Thread: IN or ANY for batch queries

IN or ANY for batch queries

From
Alessandro Gherardi
Date:
Hi,
I'd like to understand if there are any benefits to using SELECT ... WHERE primary_key = ANY(large array of values) as opposed to SELECT ... WHERE primary_key IN(?,?,...) via JDBC.

I'm aware that to use ANY(array), I first have to call array = connection.createArrayOf(Java array), then preparedStatement.setArray(X, array).

I ran a few tests and the latency seems to be pretty much the same. However, since in my application the number of values in the array can vary, I'm wondering if using ANY has the benefit of causing the driver/DB to cache a smaller number of prepared statement.

Any thoughts?

Thank you in advance,
Alessandro

Re: IN or ANY for batch queries

From
Dave Cramer
Date:
Well we aren't going to cache the array so probably not.


On 25 May 2018 at 22:10, Alessandro Gherardi <alessandro.gherardi@yahoo.com> wrote:
Hi,
I'd like to understand if there are any benefits to using SELECT ... WHERE primary_key = ANY(large array of values) as opposed to SELECT ... WHERE primary_key IN(?,?,...) via JDBC.

I'm aware that to use ANY(array), I first have to call array = connection.createArrayOf(Java array), then preparedStatement.setArray(X, array).

I ran a few tests and the latency seems to be pretty much the same. However, since in my application the number of values in the array can vary, I'm wondering if using ANY has the benefit of causing the driver/DB to cache a smaller number of prepared statement.

Any thoughts?

Thank you in advance,
Alessandro

Re: IN or ANY for batch queries

From
Charles Pritchard
Date:
Go ahead an test the array length; if it's large, create a temporary table and run a batch insert.

That's been my experience in -everything- whether SQLite, PostgreSQL or HiveQL.


On Fri, May 25, 2018, 7:10 PM Alessandro Gherardi <alessandro.gherardi@yahoo.com> wrote:
Hi,
I'd like to understand if there are any benefits to using SELECT ... WHERE primary_key = ANY(large array of values) as opposed to SELECT ... WHERE primary_key IN(?,?,...) via JDBC.

I'm aware that to use ANY(array), I first have to call array = connection.createArrayOf(Java array), then preparedStatement.setArray(X, array).

I ran a few tests and the latency seems to be pretty much the same. However, since in my application the number of values in the array can vary, I'm wondering if using ANY has the benefit of causing the driver/DB to cache a smaller number of prepared statement.

Any thoughts?

Thank you in advance,
Alessandro

Re: IN or ANY for batch queries

From
Craig Ringer
Date:
PostgreSQL translates IN (...) to = ANY(ARRAY[...]) internally. They're largely the same thing.

In theory for small IN lists it could specialize the query into a union with specialized plans for different values, but it doesn't do that.

Charles is right. For smaller sets use either form, whatever is convenient. For larger sets create a temp table, COPY to populate it, create indexes on it if you need to, and join on it.

Re: IN or ANY for batch queries

From
Vladimir Sitnikov
Date:
Hi,

>However, since in my application the number of values in the array can vary, I'm wondering if using ANY has the benefit of causing the driver/DB to cache a smaller number of prepared statement.

Right you are. In case you use Array, the whole thing is passed as a single bind, and it results in a single prepared statement.
Naive use of (?, ?, ?, ?, ?) would result in multiple prepared statements that would consume memory at both client and server sides.

1) Note: sometimes you might want to use (?, ?, ?) kind of syntax (e.g. to support DBs that cannot process arrays), then you might want to use "power of two" placeholders, and fill the excessive ones with null value or with a repetition of the last value.
For instance, if you want to pass 5 values 1,2,3,4,5 you'd better pass it as
(1,2,3,4,5,5,5,5) (of course, use ?,?,?,?, ?,?,?,? )

I do not advice you to use that extensively, however it is a nice to know trick.

2) As "array-based-SQL" is the same statement, you will get pretty much the same execution plan no matter how many values you put there.
I find that a plus since I do not want my application to flip to some nasty "table seq scan" as the number of values flips from 21 to 22 or whatever.

The execution plan might vary in case you use variable number of "?".

3) Of course, if you plan to move lots of data, then COPY and/or insert batch (+ reWriteBatchedInserts=true) might be your friends there.

4) Note: array support is limited, so currently you might fail to pass "array of composite type".

Vladimir

Re: IN or ANY for batch queries

From
Craig Ringer
Date:
On 28 May 2018 at 18:34, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
Hi,

>However, since in my application the number of values in the array can vary, I'm wondering if using ANY has the benefit of causing the driver/DB to cache a smaller number of prepared statement.

Right you are. In case you use Array, the whole thing is passed as a single bind, and it results in a single prepared statement.
Naive use of (?, ?, ?, ?, ?) would result in multiple prepared statements that would consume memory at both client and server sides.


Right. I didn't think you'd do that, I assumed you'd interpolate properly quoted SQL literals. Not only is that more vulnerable to potential issues if your quoting is flawed, it often uses more memory and CPU for the string processing, so an array bind is definitely better.

I had my "writing TAP tests in Perl" hat on, not my "developing sane SQL" hat on. Sorry.
 
1) Note: sometimes you might want to use (?, ?, ?) kind of syntax (e.g. to support DBs that cannot process arrays), then you might want to use "power of two" placeholders, and fill the excessive ones with null value or with a repetition of the last value.
For instance, if you want to pass 5 values 1,2,3,4,5 you'd better pass it as
(1,2,3,4,5,5,5,5) (of course, use ?,?,?,?, ?,?,?,? )

I do not advice you to use that extensively, however it is a nice to know trick.

... ugh. Cool, but gross.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: IN or ANY for batch queries

From
Alessandro Gherardi
Date:
Hi,
Thanks everyone for your feedback. I have one more question.

I'm working on enhancing Hibernate to take advantage of WHERE ... = ANY(?) for batch queries and .The enhancement is specific to postgres so it has to be added to one of the Hibernate postgres "dialect" classes.

Hibernate has separate dialect classes for postgres 8.1, 8.2, 9, 9.1, 9.2, 9.3, 9.4 and 9.5. Each class extends the previous release class - i.e., 9.5 extends 9.4 which extends 9.3, etc.

I'm trying to figure out to which class I should add this enhancement. Ideally, it should be the class associated with the earliest release of postgres that supports this functionality. I tried to find out in which release WHERE ... = ANY(?) was first implemented but have had no luck.

Does anyone have any idea?

Thanks again,
Alessandro


On Monday, May 28, 2018, 7:01:40 AM MDT, Craig Ringer <craig@2ndquadrant.com> wrote:


On 28 May 2018 at 18:34, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
Hi,

>However, since in my application the number of values in the array can vary, I'm wondering if using ANY has the benefit of causing the driver/DB to cache a smaller number of prepared statement.

Right you are. In case you use Array, the whole thing is passed as a single bind, and it results in a single prepared statement.
Naive use of (?, ?, ?, ?, ?) would result in multiple prepared statements that would consume memory at both client and server sides.


Right. I didn't think you'd do that, I assumed you'd interpolate properly quoted SQL literals. Not only is that more vulnerable to potential issues if your quoting is flawed, it often uses more memory and CPU for the string processing, so an array bind is definitely better.

I had my "writing TAP tests in Perl" hat on, not my "developing sane SQL" hat on. Sorry.
 
1) Note: sometimes you might want to use (?, ?, ?) kind of syntax (e.g. to support DBs that cannot process arrays), then you might want to use "power of two" placeholders, and fill the excessive ones with null value or with a repetition of the last value.
For instance, if you want to pass 5 values 1,2,3,4,5 you'd better pass it as
(1,2,3,4,5,5,5,5) (of course, use ?,?,?,?, ?,?,?,? )

I do not advice you to use that extensively, however it is a nice to know trick.

... ugh. Cool, but gross.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: IN or ANY for batch queries

From
Tom Lane
Date:
Alessandro Gherardi <alessandro.gherardi@yahoo.com> writes:
> I'm trying to figure out to which class I should add this enhancement. Ideally, it should be the class associated
withthe earliest release of postgres that supports this functionality. I tried to find out in which release WHERE ... =
ANY(?)was first implemented but have had no luck. 

Digging in the git history says it was 7.4:

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL7_4_BR [bee217924] 2003-06-29 00:33:44 +0000

    Support expressions of the form 'scalar op ANY (array)' and
    'scalar op ALL (array)', where the operator is applied between the
    lefthand scalar and each element of the array.  The operator must
    yield boolean; the result of the construct is the OR or AND of the
    per-element results, respectively.

    Original coding by Joe Conway, after an idea of Peter's.  Rewritten
    by Tom to keep the implementation strictly separate from subqueries.

            regards, tom lane