Thread: Limit vs setMaxRows issue

Limit vs setMaxRows issue

From
Sebastiaan van Erk
Date:
Hi,

When using the ps.setMaxRows() call on PreparedStatement, the jdbc
driver sets the row limit via the "raw" postgres protocol. However, in
the words of Tom Lane, "the row limit in the protocol only says how many
rows to deliver in the first batch. The presumption is that you'll
eventually grab the rest, and so the query is planned on that basis."

What this means that when we do the following query:

select action_id from actions order by action_id

with a ps.setMaxRows(100), it takes about 1.8 seconds for the query to
complete. However, if we do the following query:

select action_id from actions order by action_id limit 100

without any ps.setMaxRows() the query only takes 0.156 seconds. This is
more than a factor of 10 faster.

I'm not 100% sure of what JDBC says about setMaxRows (it's kind of
ambiguous in the java doc, as usual), but as far as I can tell, if you
call setMaxRows on the prepared statement there is no way in to ever
retrieve more than that number of rows. If this is indeed the case, it
seems to me that currently there is a mismatch between the JDBC api and
the postgresql api, and JDBC should somehow tell postgres that this is a
hard limit and it should not plan for a second batch.

Therefore, my question is: is this a bug? It is not feasable for me to
add LIMIT clauses to all the SQL queries in my code, so if this IS a
bug, I hope it can be fixed. If it is NOT a bug, is there an alternative
workaround that does not involve changing all of my sql statements?

Thanks in advance,
Sebastiaan



Re: Limit vs setMaxRows issue

From
Dave Cramer
Date:
Sebastiaan,

I believe the setMaxRows will use a cursor, because you have an order
by on the cursor it will have to be fully materialized

Try it without the order by

Dave
On 21-Jun-06, at 5:11 AM, Sebastiaan van Erk wrote:

> Hi,
>
> When using the ps.setMaxRows() call on PreparedStatement, the jdbc
> driver sets the row limit via the "raw" postgres protocol. However,
> in the words of Tom Lane, "the row limit in the protocol only says
> how many rows to deliver in the first batch. The presumption is
> that you'll eventually grab the rest, and so the query is planned
> on that basis."
>
> What this means that when we do the following query:
>
> select action_id from actions order by action_id
>
> with a ps.setMaxRows(100), it takes about 1.8 seconds for the query
> to complete. However, if we do the following query:
>
> select action_id from actions order by action_id limit 100
>
> without any ps.setMaxRows() the query only takes 0.156 seconds.
> This is more than a factor of 10 faster.
>
> I'm not 100% sure of what JDBC says about setMaxRows (it's kind of
> ambiguous in the java doc, as usual), but as far as I can tell, if
> you call setMaxRows on the prepared statement there is no way in to
> ever retrieve more than that number of rows. If this is indeed the
> case, it seems to me that currently there is a mismatch between the
> JDBC api and the postgresql api, and JDBC should somehow tell
> postgres that this is a hard limit and it should not plan for a
> second batch.
>
> Therefore, my question is: is this a bug? It is not feasable for me
> to add LIMIT clauses to all the SQL queries in my code, so if this
> IS a bug, I hope it can be fixed. If it is NOT a bug, is there an
> alternative workaround that does not involve changing all of my sql
> statements?
>
> Thanks in advance,
> Sebastiaan
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


Re: Limit vs setMaxRows issue

From
Sebastiaan van Erk
Date:
Hi,

I'm sorry, but I don't really know what you mean with setMaxRows using a
cursor.

Note that the *same query* (including the order by) is fast when it
contains the LIMIT 100 sql appended to it. So this query *also* does an
order by. The only difference between the queries is that one uses the
setMaxRows() call to limit the number of rows in the result set, and the
other uses SQL.

Greetings,
Sebastiaan


Dave Cramer wrote:
> Sebastiaan,
>
> I believe the setMaxRows will use a cursor, because you have an order
> by on the cursor it will have to be fully materialized
>
> Try it without the order by
>
> Dave
> On 21-Jun-06, at 5:11 AM, Sebastiaan van Erk wrote:
>
>> Hi,
>>
>> When using the ps.setMaxRows() call on PreparedStatement, the jdbc
>> driver sets the row limit via the "raw" postgres protocol. However,
>> in the words of Tom Lane, "the row limit in the protocol only says
>> how many rows to deliver in the first batch. The presumption is that
>> you'll eventually grab the rest, and so the query is planned on that
>> basis."
>>
>> What this means that when we do the following query:
>>
>> select action_id from actions order by action_id
>>
>> with a ps.setMaxRows(100), it takes about 1.8 seconds for the query
>> to complete. However, if we do the following query:
>>
>> select action_id from actions order by action_id limit 100
>>
>> without any ps.setMaxRows() the query only takes 0.156 seconds. This
>> is more than a factor of 10 faster.
>>
>> I'm not 100% sure of what JDBC says about setMaxRows (it's kind of
>> ambiguous in the java doc, as usual), but as far as I can tell, if
>> you call setMaxRows on the prepared statement there is no way in to
>> ever retrieve more than that number of rows. If this is indeed the
>> case, it seems to me that currently there is a mismatch between the
>> JDBC api and the postgresql api, and JDBC should somehow tell
>> postgres that this is a hard limit and it should not plan for a
>> second batch.
>>
>> Therefore, my question is: is this a bug? It is not feasable for me
>> to add LIMIT clauses to all the SQL queries in my code, so if this IS
>> a bug, I hope it can be fixed. If it is NOT a bug, is there an
>> alternative workaround that does not involve changing all of my sql
>> statements?
>>
>> Thanks in advance,
>> Sebastiaan
>>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 2: Don't 'kill -9' the postmaster
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

Re: Limit vs setMaxRows issue

From
Kris Jurka
Date:

On Wed, 21 Jun 2006, Sebastiaan van Erk wrote:

> I'm not 100% sure of what JDBC says about setMaxRows (it's kind of ambiguous
> in the java doc, as usual), but as far as I can tell, if you call setMaxRows
> on the prepared statement there is no way in to ever retrieve more than that
> number of rows. If this is indeed the case, it seems to me that currently
> there is a mismatch between the JDBC api and the postgresql api, and JDBC
> should somehow tell postgres that this is a hard limit and it should not plan
> for a second batch.
>
> Therefore, my question is: is this a bug? It is not feasable for me to add
> LIMIT clauses to all the SQL queries in my code, so if this IS a bug, I hope
> it can be fixed. If it is NOT a bug, is there an alternative workaround that
> does not involve changing all of my sql statements?
>

I'm not sure how you would like the driver to tell the server that it
doesn't want more than setMaxRows rows.  The defined API for this is using
LIMIT in your sql query.  The driver cannot do this for you (at least
without parsing your query) because the query may already have a LIMIT or
it may be something like an INSERT into a VIEW that has a DO INSTEAD
SELECT rule on it.  If you're suggesting that we extended the
frontend/backend protocol to include this extra information than that's
definitely a feature request, not a bug report.

Kris Jurka

Re: Limit vs setMaxRows issue

From
"A.M."
Date:
On Wed, June 21, 2006 12:00 pm, Kris Jurka wrote:

>
> I'm not sure how you would like the driver to tell the server that it
> doesn't want more than setMaxRows rows.  The defined API for this is using
>  LIMIT in your sql query.  The driver cannot do this for you (at least
> without parsing your query) because the query may already have a LIMIT or
> it may be something like an INSERT into a VIEW that has a DO INSTEAD
> SELECT rule on it.  If you're suggesting that we extended the
> frontend/backend protocol to include this extra information than that's
> definitely a feature request, not a bug report.

The backend protocol already supports maximum row limit if you use the
extended protocol.

I guess it's a legacy thing in the driver, but there really is no reason
to use the simple query protocol at all on recent postgresqls.

-M


Re: Limit vs setMaxRows issue

From
Tom Lane
Date:
"A.M." <agentm@themactionfaction.com> writes:
> On Wed, June 21, 2006 12:00 pm, Kris Jurka wrote:
>> If you're suggesting that we extended the
>> frontend/backend protocol to include this extra information than that's
>> definitely a feature request, not a bug report.

> The backend protocol already supports maximum row limit if you use the
> extended protocol.

No, it would take a protocol change to add such a thing out-of-line
(that is, not as a LIMIT clause in the query text).  The reason is that
the planning is done at PARSE time, or at the latest BIND time.  The
row limit field in the EXECUTE message comes far too late to affect the
query plan.  EXECUTE's row limit was not meant as anything except a way
to fetch a query result in segments, avoiding the grab-it-all-at-once,
run-out-of-memory syndrome.  It is definitely *not* meant to imply that
the client doesn't intend to fetch the whole query result eventually.

I don't have a lot of sympathy for the OP's position that he shouldn't
have to use a LIMIT clause for this ...

            regards, tom lane

Re: Limit vs setMaxRows issue

From
Oliver Jowett
Date:
Sebastiaan van Erk wrote:

> Therefore, my question is: is this a bug? It is not feasable for me to
> add LIMIT clauses to all the SQL queries in my code, so if this IS a
> bug, I hope it can be fixed. If it is NOT a bug, is there an alternative
> workaround that does not involve changing all of my sql statements?

It's not a bug. setMaxRows() is essentially a hint, there's certainly no
requirement that the driver will go off and add LIMIT clauses to your
query, the minimal implementation won't change query execution at all
and will just limit rows coming back out of the ResultSet.. It might be
nice to add LIMIT but that would require the driver to parse query
strings which gets very complicated and isn't going to catch all the
cases anyway. You'll be getting at least some improvement with the
existing driver because the whole resultset isn't being transferred and
processed, even if the plan is still assuming you will grab all the data.

If your queries need a LIMIT clause to get decent performance then your
safest bet is to add a LIMIT clause yourself. You can keep the
setMaxRows() as well if you like..

-O

Re: Limit vs setMaxRows issue

From
Sebastiaan van Erk
Date:
Hi,

Thanks for the helpful replies (thanks also to Oliver Jowett).

As it seems to me to be the case that a setMaxRows call actually limits
the data you can access to that specific number of rows and it is
impossible to ever get more rows, it seems to me to be a waste of time
and a loss of performance if the backend does not know this and prepares
the result as if everything will (eventually) be returned.

I am not suggesting that the driver parse queries and add a LIMIT clause
itself. This would make the driver exceedingly complex, it would
duplicate logic in the driver that is already in postgres itself and
cause an extra maintenance nightmare. Furthermore it would probably
introduce many new bugs, cause lots of work, and all for functionality
(i.e., limit the resultset to n rows) that *already exists* in postgres
itself.

So I guess it is indeed a feature request then; that the backend
protocol supports limiting the resultset without having to alter the
query, and that this limit is indeed a hard limit [i will never ask for
more rows] (instead of a soft limit [i might ask for the other rows]).
Considering how all the *functionality* at least is already implemented,
this should not be too much work, I imagine.

The reason I would like to see this feature (instead of adding the LIMIT
manually) is for cross database compatibility (which is the essence of
JDBC). Basically, setMaxRows is portable, LIMIT is not. Since I am part
of a team developing a cross-database application in which performance
is often important, this feature is quite important to us. Currently
postgres is slow for us on simple index queries on large data sets (1.8
seconds for the first 100 primary keys only of a table of 43000 rows);
and unfortunately, these kinds of queries are very common in our
application.

Regards,
Sebastiaan

Kris Jurka wrote:
>
>
> On Wed, 21 Jun 2006, Sebastiaan van Erk wrote:
>
>> I'm not 100% sure of what JDBC says about setMaxRows (it's kind of
>> ambiguous in the java doc, as usual), but as far as I can tell, if
>> you call setMaxRows on the prepared statement there is no way in to
>> ever retrieve more than that number of rows. If this is indeed the
>> case, it seems to me that currently there is a mismatch between the
>> JDBC api and the postgresql api, and JDBC should somehow tell
>> postgres that this is a hard limit and it should not plan for a
>> second batch.
>>
>> Therefore, my question is: is this a bug? It is not feasable for me
>> to add LIMIT clauses to all the SQL queries in my code, so if this IS
>> a bug, I hope it can be fixed. If it is NOT a bug, is there an
>> alternative workaround that does not involve changing all of my sql
>> statements?
>>
>
> I'm not sure how you would like the driver to tell the server that it
> doesn't want more than setMaxRows rows.  The defined API for this is
> using LIMIT in your sql query.  The driver cannot do this for you (at
> least without parsing your query) because the query may already have a
> LIMIT or it may be something like an INSERT into a VIEW that has a DO
> INSTEAD SELECT rule on it.  If you're suggesting that we extended the
> frontend/backend protocol to include this extra information than
> that's definitely a feature request, not a bug report.
>
> Kris Jurka

Re: Limit vs setMaxRows issue

From
Mark Lewis
Date:
> The reason I would like to see this feature (instead of adding the LIMIT
> manually) is for cross database compatibility (which is the essence of
> JDBC). Basically, setMaxRows is portable, LIMIT is not. Since I am part
> of a team developing a cross-database application in which performance
> is often important, this feature is quite important to us. Currently
> postgres is slow for us on simple index queries on large data sets (1.8
> seconds for the first 100 primary keys only of a table of 43000 rows);
> and unfortunately, these kinds of queries are very common in our
> application.

JDBC is a little too low-level to give true database independence; you
can write portable queries, but you're severely restricted when it comes
to functionality supported by most databases but not in a standardized
way, such as limits, locking, performance hinting, sequences/serials,
etc.

For simple, non-performance critical apps you can mostly get away with
it (as we did for a while with some of our products).  But for anything
more sophisticated, your application really needs a way to deal with
database-specific SQL.

On newer projects we use Hibernate HQL, which has been a major boon in
terms of database portability and performance.

-- Mark Lewis

Re: Limit vs setMaxRows issue

From
David Wall
Date:
What about the setFetchSize method?  My impression is that using
setFetchSize along with setMaxRows would do the trick on the back end as
it resulted in the cursor mechanism that didn't retrieve the complete
result set.  Is that not the case in PG 8.1 (at least with the few
caveats listed at http://jdbc.postgresql.org/documentation/81/query.html)?

David


> It's not a bug. setMaxRows() is essentially a hint, there's certainly
> no requirement that the driver will go off and add LIMIT clauses to
> your query, the minimal implementation won't change query execution at
> all and will just limit rows coming back out of the ResultSet.. It
> might be nice to add LIMIT but that would require the driver to parse
> query strings which gets very complicated and isn't going to catch all
> the cases anyway. You'll be getting at least some improvement with the
> existing driver because the whole resultset isn't being transferred
> and processed, even if the plan is still assuming you will grab all
> the data.
>
> If your queries need a LIMIT clause to get decent performance then
> your safest bet is to add a LIMIT clause yourself. You can keep the
> setMaxRows() as well if you like..



Re: Limit vs setMaxRows issue

From
Sebastiaan van Erk
Date:
Hi,

We actually do use a part of hibernate (the dialects) with some custom additions, especially for creating tables, indexes, etc. However, since the user of the application creates the data model (and changes it runtime), and since the user is actually working in a  tables and columns paradigm, we cannot really use the ORM or HQL parts of hibernate.

Anyway, apart from the joys (i.e. difficulties) in writing portable SQL (oh what fun ;-)), I still think it would be a nice improvement to the JDBC driver is setMaxRows did actually hard limit the number of returned rows, considering it hardly make sense in this case for postgres to prepare to return more rows than specified.

I'm willing to help or even code it if someone points me in the right direction. The JDBC part should be easy; I'm not sure about the postgres side, but I imagine it's not too difficult either. The hard part is probably getting the protocol extended and getting people to agree that it's a good idea to change the protocol, etc... ;-) (although it's a pretty non-intrusive, backwards compatible change).

Regards,
Sebastiaan

Mark Lewis wrote:
The reason I would like to see this feature (instead of adding the LIMIT 
manually) is for cross database compatibility (which is the essence of 
JDBC). Basically, setMaxRows is portable, LIMIT is not. Since I am part 
of a team developing a cross-database application in which performance 
is often important, this feature is quite important to us. Currently 
postgres is slow for us on simple index queries on large data sets (1.8 
seconds for the first 100 primary keys only of a table of 43000 rows); 
and unfortunately, these kinds of queries are very common in our 
application.   
JDBC is a little too low-level to give true database independence; you
can write portable queries, but you're severely restricted when it comes
to functionality supported by most databases but not in a standardized
way, such as limits, locking, performance hinting, sequences/serials,
etc.

For simple, non-performance critical apps you can mostly get away with
it (as we did for a while with some of our products).  But for anything
more sophisticated, your application really needs a way to deal with
database-specific SQL.

On newer projects we use Hibernate HQL, which has been a major boon in
terms of database portability and performance.

-- Mark Lewis

 

Re: Limit vs setMaxRows issue

From
Marc Herbert
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> No, it would take a protocol change to add such a thing out-of-line
> (that is, not as a LIMIT clause in the query text).  The reason is that
> the planning is done at PARSE time, or at the latest BIND time.  The
> row limit field in the EXECUTE message comes far too late to affect the
> query plan.


If planning is done at time of creation of the PreparedStatement
object (reminder: the example given above has no parameters), then the
setMaxRows() call will come too late whatever is the protocol change.
I mean: no protocol change can go back in time and "optimize" by not
doing useless work already done.

Thanks in advance for pointing out my mistake(s) here.

Re: Limit vs setMaxRows issue

From
Marc Herbert
Date:
Mark Lewis <mark.lewis@mir3.com> writes:

> JDBC is a little too low-level to give true database independence; you
> can write portable queries, but you're severely restricted when it comes
> to functionality supported by most databases but not in a standardized
> way, such as limits, locking, performance hinting, sequences/serials,
> etc.
>
> For simple, non-performance critical apps you can mostly get away with
> it (as we did for a while with some of our products).  But for anything
> more sophisticated, your application really needs a way to deal with
> database-specific SQL.

Thanks a lot for sharing this very valuable experience from the
field. However .setMaxRows() IS currently part of this portable and
severely restricted JDBC performance feature set, so I see no reason
for not trying to implement it as best as possible.

The more performance you can get from JDBC, the less not portable code
people will write.

And JDBC is evolving too.


Re: Limit vs setMaxRows issue

From
Marc Herbert
Date:
Oliver Jowett <oliver@opencloud.com> writes:

> It's not a bug. setMaxRows() is essentially a hint, there's certainly
> no requirement that the driver will go off and add LIMIT clauses to
> your query,

My reading of the JDBC javadoc and my excessive "pickiness" do not
agree with the word "hint"

    /**
     * Sets the limit for the maximum number of rows that any
     * <code>ResultSet</code> object can contain to the given number.
     * If the limit is exceeded, the excess
     * rows are silently dropped.


OK nothing ever tells you that the server has to behave optimally and
never compute anything useless. However in this case you know _for
sure_ that additional rows will be useless.


Re: Limit vs setMaxRows issue

From
Oliver Jowett
Date:
Marc Herbert wrote:
> Oliver Jowett <oliver@opencloud.com> writes:
>
>
>>It's not a bug. setMaxRows() is essentially a hint, there's certainly
>>no requirement that the driver will go off and add LIMIT clauses to
>>your query,
>
>
> My reading of the JDBC javadoc and my excessive "pickiness" do not
> agree with the word "hint"
>
>     /**
>      * Sets the limit for the maximum number of rows that any
>      * <code>ResultSet</code> object can contain to the given number.
>      * If the limit is exceeded, the excess
>      * rows are silently dropped.
>
>
> OK nothing ever tells you that the server has to behave optimally and
> never compute anything useless. However in this case you know _for
> sure_ that additional rows will be useless.

I would assume that if the intent of the spec was "put LIMIT on the
query", they would say that.

As it stands all it does from an API point of view is change the
behaviour of the ResultSet. Whether that does anything for efficiency is
an implementation decision. The driver certainly behaves correctly
according to the API, which was my point.

There are other optimizations you can make that don't affect query
execution at all -- for example, you know you only need to store the
first N rows returned, not all of them.

-O

Re: Limit vs setMaxRows issue

From
Oliver Jowett
Date:
Marc Herbert wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>
>>No, it would take a protocol change to add such a thing out-of-line
>>(that is, not as a LIMIT clause in the query text).  The reason is that
>>the planning is done at PARSE time, or at the latest BIND time.  The
>>row limit field in the EXECUTE message comes far too late to affect the
>>query plan.
>
>
>
> If planning is done at time of creation of the PreparedStatement
> object (reminder: the example given above has no parameters), then the
> setMaxRows() call will come too late whatever is the protocol change.
> I mean: no protocol change can go back in time and "optimize" by not
> doing useless work already done.
>
> Thanks in advance for pointing out my mistake(s) here.

We do not special-case the no-parameters case, so it's handled just like
all the other cases: the query is parsed and planned immediately before
execution. We also avoid an extra round-trip by doing it at that point.

If you're interested in the details of this, the driver source code is
really your best reference..

-O

Re: Limit vs setMaxRows issue

From
Marc Herbert
Date:
Oliver Jowett <oliver@opencloud.com> writes:

> Marc Herbert wrote:

>> If planning is done at time of creation of the PreparedStatement
>> object (reminder: the example given above has no parameters), then the
>> setMaxRows() call will come too late whatever is the protocol change.
>> I mean: no protocol change can go back in time and "optimize" by not
>> doing useless work already done.
>> Thanks in advance for pointing out my mistake(s) here.
>
> We do not special-case the no-parameters case, so it's handled just
> like all the other cases: the query is parsed and planned
> immediately before execution.

OK, I should not have put this "reminder" above. I thought I would
simplify the discussion but it did not.

According to:
http://www.postgresql.org/docs/8.1/interactive/sql-prepare.html as
well as to any other non-DB specific, similar documentation, the
server is able to plan the query at parse time BEFORE receiving any
actual parameter. Connection.preparedStatement()'s Javadoc calls this
"pre-compilation" (and it's optional).


> the query is parsed and planned immediately before execution.

Hum, interesting. Looks like "lazy prepared" statement, no
pre-compilation? If you delay parsing & planning then of course you
would not need to go back in time to add late
optimizations...

However, what about the following executions of the same and now
already prepared and planned statement? Now you would have to go back
in time to perform any .setMaxRows() optimization, right?

I assume here that it's the query planning phase that would benefit
from any .setMaxRows() optimization, correct me if I'm wrong.


> We also avoid an extra round-trip by doing it at
> that point.

Then you also avoid any latency benefit that _could_ arise thanks to
pre-compilation. At least for the first execution.



> If you're interested in the details of this, the driver source code is
> really your best reference..

Sure, but for such high-level architectural questions you can easily
understand that I prefer to read your enlightning explanations.


Re: Limit vs setMaxRows issue

From
Marc Herbert
Date:
>> My reading of the JDBC javadoc and my excessive "pickiness" do not
>> agree with the word "hint"
>>     /**
>>      * Sets the limit for the maximum number of rows that any
>>      * <code>ResultSet</code> object can contain to the given number.
>>      * If the limit is exceeded, the excess
>>      * rows are silently dropped.
>> OK nothing ever tells you that the server has to behave optimally and
>> never compute anything useless. However in this case you know _for
>> sure_ that additional rows will be useless.
>
> I would assume that if the intent of the spec was "put LIMIT on the
> query", they would say that.

I think they would never say that, because they are quite careful not
to ever try to enforce any kind of implementation/performance detail.
They just care about the interface and the external results, any
timing consideration always excluded.

So the way they seem to do it, is just to open a 8 lanes wide road to
performance in front of you ("if you compute that, it WILL be useless
and dropped at some point. Now compute what you want."). My poor
english skills would not have called this a "hint". I guess this is
just a minor misunderstanding.

Of course I'm assuming here that computing those excess rows has no
side-effect. Hopefully this is always the case?!


> As it stands all it does from an API point of view is change the
> behaviour of the ResultSet. Whether that does anything for efficiency
> is an implementation decision. The driver certainly behaves correctly
> according to the API, which was my point.

Agreed for sure.


Re: Limit vs setMaxRows issue

From
Oliver Jowett
Date:
Marc Herbert wrote:
> Oliver Jowett <oliver@opencloud.com> writes:

>> the query is parsed and planned immediately before execution.
>
> Hum, interesting. Looks like "lazy prepared" statement, no
> pre-compilation? If you delay parsing & planning then of course you
> would not need to go back in time to add late
> optimizations...

I don't know what you mean by "lazy prepared" statements. We give the
statement to the server for parsing and planning at the point when we
know both the query and the parameter types -- which, because of the
JDBC API design, means just before execution. We retain the parse/plan
results on the server side when it looks like the statement will be
reused (using a simple "how many times has this statement already been
reused?" metric), otherwise we reparse/replan on each execution.

> However, what about the following executions of the same and now
> already prepared and planned statement? Now you would have to go back
> in time to perform any .setMaxRows() optimization, right?

It's no different to the case where you change the parameter types really.

-O


Re: Limit vs setMaxRows issue

From
Marc Herbert
Date:
Oliver Jowett <oliver@opencloud.com> writes:

> Marc Herbert wrote:
>> Oliver Jowett <oliver@opencloud.com> writes:
>
>>> the query is parsed and planned immediately before execution.
>> Hum, interesting. Looks like "lazy prepared" statement, no
>> pre-compilation? If you delay parsing & planning then of course you
>> would not need to go back in time to add late
>> optimizations...
>
> I don't know what you mean by "lazy prepared" statements. We give the
> statement to the server for parsing and planning at the point when we
> know both the query and the parameter types -- which, because of the
> JDBC API design, means just before execution.

OK thanks, now I think I got it: it seems like the JDBC API does not
assume you need parameter types to plan the query.

Connection.preparedStatement()

     * If the driver supports precompilation,
     * the method <code>prepareStatement</code> will send
     * the statement to the database for precompilation.

No word about parameter types.  If you do not need the datatypes, then
you can send/plan the query way earlier (not "lazy"), at
PreparedStatement object construction time.

I should have noticed this issue from here
<http://www.postgresql.org/docs/8.1/interactive/sql-prepare.html>
Looks like we have an API mismatch here.

Is this parameter types issue specific to PostgreSQL, or is just the
JDBC API badly designed on this point?

Another approach could be for the server to infer the types from the
query string?

If not 100% reliably, then in an optimistic way?


> We retain the parse/plan results on the server side when it looks
> like the statement will be reused (using a simple "how many times
> has this statement already been reused?" metric), otherwise we
> reparse/replan on each execution.

Could this be compatible with any setMaxRows() optimization? I guess
yes, provided the maxRows parameter is considered in preparedstatement
matching, just like datatypes seem to be... Maybe this is already the
case for non-JDBC PREPARE + LIMIT?


Thanks again for all these answers.


Re: Limit vs setMaxRows issue

From
Oliver Jowett
Date:
Marc Herbert wrote:

> OK thanks, now I think I got it: it seems like the JDBC API does not
> assume you need parameter types to plan the query.

The JDBC API doesn't say anything at all about query planning, AFAIK.

> Connection.preparedStatement()
>
>      * If the driver supports precompilation,
>      * the method <code>prepareStatement</code> will send
>      * the statement to the database for precompilation.

We don't support precompilation in the sense it's used here, then.

[...]

I am a bit confused about what this discussion is actually about .. do
you have a point to make here? What is it that you want to do that the
current driver doesn't do well? A fair amount of work has gone into
getting query execution working smoothly and efficiently within the
constraints of the API already.. Vague high-level handwaving, especially
without a clear target, doesn't get us anywhere. For example, it's
largely irrelevant to an application whether the query gets parsed by
the server at statement creation or statement execution .. at worst, it
means you see parse errors at a different point.

>>We retain the parse/plan results on the server side when it looks
>>like the statement will be reused (using a simple "how many times
>>has this statement already been reused?" metric), otherwise we
>>reparse/replan on each execution.
>
> Could this be compatible with any setMaxRows() optimization? I guess
> yes, provided the maxRows parameter is considered in preparedstatement
> matching, just like datatypes seem to be... Maybe this is already the
> case for non-JDBC PREPARE + LIMIT?

This is all very hypothetical without actual protocol support. I'd
suggest you start from a proposed protocol design and work back from
there. The limiting factor is likely to be what you can easily support
on the server side, not the driver implementation.

-O

Re: Limit vs setMaxRows issue

From
Marc Herbert
Date:
Oliver Jowett <oliver@opencloud.com> writes:

> Marc Herbert wrote:
>
>> OK thanks, now I think I got it: it seems like the JDBC API does not
>> assume you need parameter types to plan the query.
>
> The JDBC API doesn't say anything at all about query planning, AFAIK.
>
>> Connection.preparedStatement()
>>      * If the driver supports precompilation,
>>      * the method <code>prepareStatement</code> will send
>>      * the statement to the database for precompilation.
>
> We don't support precompilation in the sense it's used here, then.

Again here, I think the word "compilation" is purposely fuzzy to be
portable across DBMS. The message is:

 if  the driver supports <pre-compilation|pre-parsing|pre-planning|pre-whatever>
 then
    the PreparedStatement will be _sent to the database_ at
    object creation time then not sent again but referred to.

By the way that's precisely the reason why the SQL request string of
PreparedStatement is documented as not mutable, to allow this
optimization.

PreparedStatement are quite obviously meant to match the SQL "PREPARE"
statement or the ODBC SQLPrepare() function
 <http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odbc/htm/odbcprepared_execution.asp>
The name re-use is not just coincidence, don't you think?


> [...]
>
> I am a bit confused about what this discussion is actually about .. do
> you have a point to make here? What is it that you want to do that the
> current driver doesn't do well? A fair amount of work has gone into
> getting query execution working smoothly and efficiently within the
> constraints of the API already.. Vague high-level handwaving,
> especially without a clear target, doesn't get us anywhere.

OK, this discussion has a bit slipped to a general description of the
implementation of PreparedStatement, but the main questions stayed
explicit, simple and not so far away:

 1. is it possible to get the same level of performance achieved by
 LIMIT using .setMaxRows() (question still in Subject:)

 2. when is the SQL query actually sent to the database and
 "prepared"? This question came because it seems any .setMaxRows()
 coming AFTER this point will come too late to help the server,
 whatever the protocol changes.

Then I digressed to other PreparedStatement performance questions. You
did not had to answer; thanks a lot for doing it.


> For example, it's largely irrelevant to an application whether the
> query gets parsed by the server at statement creation or statement
> execution .. at worst, it means you see parse errors at a different
> point.

It's functionally irrelevant, but this is a discussion about
performance. From what I got, calling .setMaxRows() before vs after
query parsing greatly changes the range of possible
optimizations. This discussion is just about the ordering of events
(= "hand-waving"?). I'm not even discussing about any protocol changes
at this point.

Even if you delay the parsing on the first execution in order to wait
for any helpful but late .setMaxRows(), then later .setMaxRows() on
the next re-executions will come after parsing anyway. Unless you
kinda re-parse every time...

I found very interesting to discuss about possible strategies to
handle this issue, but you do not have to be interested. If no one
else is interested then I'll stop noising the list.



> The limiting factor is likely to be what you can easily support
> on the server side, not the driver implementation.

I can understand that. Looks like we already saw that with the
datatypes issue.

Re: Limit vs setMaxRows issue

From
Markus Schaber
Date:
Hi, Marc,

Marc Herbert wrote:

> Even if you delay the parsing on the first execution in order to wait
> for any helpful but late .setMaxRows(), then later .setMaxRows() on
> the next re-executions will come after parsing anyway. Unless you
> kinda re-parse every time...

Only when setMaxRows() actually changes.

And theoretically, only re-planning is required, not re-parsing, but
currently the server does not allow clients to control those steps
independently.

HTH,
Markus


--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

"prepared" statements (Re: Limit vs setMaxRows issue)

From
Marc Herbert
Date:
Oliver Jowett <oliver@opencloud.com> writes:

> I am a bit confused about what this discussion is actually about .. do
> you have a point to make here? What is it that you want to do that the
> current driver doesn't do well? A fair amount of work has gone into
> getting query execution working smoothly and efficiently within the
> constraints of the API already.. Vague high-level handwaving,
> especially without a clear target, doesn't get us anywhere. For
> example, it's largely irrelevant to an application whether the query
> gets parsed by the server at statement creation or statement execution
> .. at worst, it means you see parse errors at a different point.


- By the way:

<http://java.sun.com/j2se/1.4.2/docs/api/java/sql/PreparedStatement.html#getMetaData()>

  Retrieves a ResultSetMetaData object that contains information about
  the columns of the ResultSet object that will be returned when this
  PreparedStatement object is executed.

  Because a PreparedStatement object is precompiled, it is possible to
  know about the ResultSet object that it will return without having to
  execute it. Consequently, it is possible to invoke the method
  getMetaData on a PreparedStatement object rather than waiting to
  execute it and then invoking the ResultSet.getMetaData method on the
  ResultSet object that is returned.

Similar thing for #getParameterMetadata()


- However, in the 3rd edition of the JDBC book:

  23.1.3 Using parameter metadata wisely

  The purpose of a prepared statement is to allow the data source to
  "prepare" an SQL statement, that is, to compile an SQL statement
  before it is executed. [..] However, not all data sources prepare a
  statement the same way, and some do not prepare them at all. If a
  data source does not precompile SQL statements, the
  ParameterMetaData methods will not be able to return results until
  execution [this is a lie because it misses "lazy" implementations,
  see below].


According to my (limited) understanding of the code, pgjdbc sends a
special "describe" request to the server for #getMetadata() in case the
statement has not been executed yet, and systematically for
#getParameterMetadata().  Neither one seems to cache results, at least
not on the driver-side.


As you said above, all this looks very well-crafted to be functionally
irrelevant to the application. It definitely looks relevant and good
to know concerning performance.