Thread: Prepared Statements and large where-id-in constant blocks?

Prepared Statements and large where-id-in constant blocks?

From
James Robinson
Date:
Howdy:

    Java middlewares like JBossCMP issue many queries in the general form
of:

        SELECT t1.att1, t1.att2 ... t1.attN FROM t1 WHERE (t1.id = X) or
(t1.id = Y) ....

where there may be anywhere between 1 and thousands of "(id = N)"
blocks ORed together. These may be transformed to the "WHERE t1.id IN
(X, Y, ...)" form for possibly a little performance gain (possibly --
I've not yet checked to see if this plans better than the other, but I
could imagine this form being parsed into the hashjoin form as opposed
to a huge index filter form).

    Real performance gains, however, could be attained through being able
to ultimately use either the v2 PREPARE / EXECUTE statements or the
better v3 protocol's bind / execute commands, but only if the SQL-level
form of the query could better represent the fact there are not really
N params, but, rather, one single param of type Set (or, heck, Array?).
This would let all such queries map onto one single backend prepared
statement, regardless of the size of the id set being passed in.

    I guess that separate preparation for each different cardinality would
be okay performance-wise, but if there were some way to get all such
queries factored-down into one single planned statement, it could:

1) Make far better use of JBoss's connection-wide LRU cache of
PreparedStatements, since only one entry (with much higher cache hit
rate) could exist for the entire family of queries.

2) Make better use of backend memory, since it only needed to prepare
one such (generic) form, as opposed to one for each cardinality.


Problems in implementation:

    1) JBoss query engine would need to be educated about the potential to
use this form as opposed to the "OR (t1.id=X)" form. Likewise, JBoss
could / should well be educated about being able to use the "WHERE
t1.id IN (X, Y, ...)" form for databases which support "WHERE .. IN (
.. )", probably an easier sell since this is most likely supported by
more DBs than just PG.

    2) Does the JDBC spec allow any hooks for passing in a set of ids as
one single param? We'd need the SQL-template to be prepared to look
something like:

    SELECT t1.attr1 FROM t1 where t1.id in ( ? )

 From memory, perhaps setArray() might could be hacked for the job. I
know JBossCMP uses
the setObject() call, so perhaps JDBC could be tricked out to handle a
java.util.Collection, an arguably cleaner way to do it -- no backward
compat issues since could be all-new functionality. JDBC driver could
just iterate through the collection contents, calling setObject
accordingly. Harder part would be educating JBoss to do this. Hardest
part would be convincing someone to commit it into JBoss.

    3) Can a cardinality-free plan even be made? I bet I'm assuming a
little too much in asserting all such plans are equal, but I suspect
that Tom is going to tell me that the query for just one id value would
and should be planned differently from the 800-value form, since the
800-value form might well prefer a full sequential scan, since the
table might only have 900 live rows in it.

Anyone have any insight or opinions?

[ crossposted to pgsql-sql for anyone's insight into the pure SQL /
planning matters. Apologies in advance ].

----
James Robinson
Socialserve.com


Re: Prepared Statements and large where-id-in constant blocks?

From
Oliver Jowett
Date:
James Robinson wrote:
> Howdy:
>
>     Java middlewares like JBossCMP issue many queries in the general
> form of:
>
>         SELECT t1.att1, t1.att2 ... t1.attN FROM t1 WHERE (t1.id = X) or
> (t1.id = Y) ....
>
> where there may be anywhere between 1 and thousands of "(id = N)" blocks
> ORed together.

>     2) Does the JDBC spec allow any hooks for passing in a set of ids as
> one single param? We'd need the SQL-template to be prepared to look
> something like:
>
>     SELECT t1.attr1 FROM t1 where t1.id in ( ? )

This was discussed a while ago when fixing some SQL escaping issues.
Previously you could use setObject to get around the usual string
escaping and do the above if you constructed the IN set string yourself,
but the driver lost a lot of type information, and if the input was
hostile you were in trouble.

The JDBC spec doesn't seem to provide a portable way to fill an IN
clause with a single parameter. An extension for supporting IN lists
safely was discussed at the time, but nothing concrete came out of it.

The CMP layer could perhaps use the = ANY array syntax and setArray()
(note that setArray() in the current driver needs fixing, I have an old
patch to do this):

   SELECT t1.attr1 FROM t1 where t1.id = ANY (?)

Alternatively, perhaps the CMP layer could generate a largish N-value IN
(?,?,?,...). Then it can reuse that single prepared query for all
queries with <= N values, filling the trailing parameters that aren't
needed with NULL or a dummy/duplicate value.

I don't know how these queries would perform compared to a
correctly-sized IN clause, though.

-O

Re: [SQL] Prepared Statements and large where-id-in constant blocks?

From
Tom Lane
Date:
James Robinson <jlrobins@socialserve.com> writes:
> where there may be anywhere between 1 and thousands of "(id = N)"
> blocks ORed together. These may be transformed to the "WHERE t1.id IN
> (X, Y, ...)" form for possibly a little performance gain (possibly --
> I've not yet checked to see if this plans better than the other, but I
> could imagine this form being parsed into the hashjoin form as opposed
> to a huge index filter form).

There is *no difference whatever*; in fact the PG parser expands an IN
clause into an OR'd list.

Possibly this is something to improve someday, but there's surely no
percentage in doing lots of work in the JDBC driver to prefer the IN
form at the moment.

            regards, tom lane

Re: Prepared Statements and large where-id-in constant blocks?

From
Oliver Jowett
Date:
Oliver Jowett wrote:

> The CMP layer could perhaps use the = ANY array syntax and setArray()
> (note that setArray() in the current driver needs fixing, I have an old
> patch to do this):
>
>   SELECT t1.attr1 FROM t1 where t1.id = ANY (?)

Unfortunately a bit of experimentation indicates that the planner
doesn't do anything clever with ANY + constant array values (at least in
7.4.1 which is what I have to hand):

> test=> explain select * from test_array where i in (1,2,3,4,5);
>                                                                       QUERY PLAN
                                 
>
-------------------------------------------------------------------------------------------------------------------------------------------------------
>  Index Scan using test_array_pkey, test_array_pkey, test_array_pkey, test_array_pkey, test_array_pkey on test_array
(cost=0.00..15.12rows=5 width=4) 
>    Index Cond: ((i = 1) OR (i = 2) OR (i = 3) OR (i = 4) OR (i = 5))
> (2 rows)
>
> test=> explain select * from test_array where i = any ('{1,2,3,4,5}'::integer[]);
>                            QUERY PLAN
> ----------------------------------------------------------------
>  Seq Scan on test_array  (cost=0.00..807.80 rows=10240 width=4)
>    Filter: (i = ANY ('{1,2,3,4,5}'::integer[]))
> (2 rows)
>
> test=> select version();
>                                                      version
> ------------------------------------------------------------------------------------------------------------------
>  PostgreSQL 7.4.1 on i386-pc-linux-gnu, compiled by GCC i386-linux-gcc (GCC) 3.3.3 20040110 (prerelease) (Debian)
> (1 row)

-O


Re: Prepared Statements and large where-id-in constant blocks?

From
James Robinson
Date:
On Apr 19, 2004, at 10:57 PM, Oliver Jowett wrote:

> Unfortunately a bit of experimentation indicates that the planner
> doesn't do anything clever with ANY + constant array values (at least
> in 7.4.1 which is what I have to hand):

Not only that, but it seems to get planned only as an index scan.
Preparing the statement
using "SELECT ... WHERE id = ANY (?)" plus a call to a really-hacked up
version of
setObject() would solve the issue of getting better use out of fewer
cached prepared
statements, but the only-sequential scan planning would be a downer.

And while "OR (id=N)" plans exactly like "id IN (N)", there seems to be
nothing really
worth doing. I examined plans comparing a 6-way join used in our
production code with
4 ids in the tail "OR (id=N)" nodes, then with the full 723 ids, and
the plans were markedly
different, preferring sequential scans for many of the intermediate
table joins in the 723-id
case.The runtimes for the sequential scans were faster than forcing
index scans, so
the planner's voodoo definitely benefits from full knowledge of how
many rows should
be expected (which appears crystal clear in hindsight). So, I doubt
that any single server-side
preparation using a single parameter representing an entire collection
of ids could perform
as well as it does currently with full information.

Oliver, I tested your proposal of providing
more-id-params-than-necessary, passing in a
dummy value (-1) which will never be found as a pk in that table, and
the
planner handled it efficiently. The repeated instances of "or
u.id=-1::int8" were, when not
pre-planned using PREPARE, were nicely crunched down to a single index
condition
clause of " OR (id= -1::BIGINT)". But, when transforming this to
PREPARE and EXECUTE
pairs, the planner cannot crunch the plan down, since it has no idea
that, say, 500 of the
700 params will all be the same, so it cannot factor them out at
planning time (hence, I guess
the warning about constant values in the notes section of the manual
page for PREPARE).

All roads seem to lead to don't attempt to change a thing -- there is
no easy or medium
difficulty way to better solve this.

In writing this, I went so far as to think about shunting the list of
ids into a temporary table
to join off of. Doing this at the SQL-level would be far too costly in
round-trip times, but
could the "WHERE id in (A, B, C, ...)" form somehow be transformed into
a hashjoin
operation on the backend when the size of the static set is 'high'?
Could this not perform
(theoretically) better than what appears to be an O(N) index condition
evaluation? I am
asking only for personal edification -- I have no sample live query
where the index condition
solution performs too slowly. In Java-land, if we suppose that the size
of the set could potentially
be 'large', we quickly defer to containing the values in a HashSet if
we're going to test for
membership as opposed to performing selection searches on a list.
Probably a dead-horse
beaten elsewhere.

Many Thanks.

----
James Robinson
Socialserve.com