Thread: optimizing large query with IN (...)

optimizing large query with IN (...)

From
"Marcus Andree S. Magalhaes"
Date:
Guys,

I got a Java program to tune. It connects to a 7.4.1 postgresql server
running Linux using JDBC.

The program needs to update a counter on a somewhat large number of
rows, about 1200 on a ~130k rows table. The query is something like
the following:

UPDATE table SET table.par = table.par + 1
WHERE table.key IN ('value1', 'value2', ... , 'value1200' )

This query runs on a  transaction (by issuing  a call to
setAutoCommit(false)) and a commit() right after the query
is sent to the backend.

The process of committing and updating the values is painfully slow
(no surprises here). Any ideas?

Thanks.



Re: optimizing large query with IN (...)

From
"Eric Jain"
Date:
> UPDATE table SET table.par = table.par + 1
> WHERE table.key IN ('value1', 'value2', ... , 'value1200' )

How fast is the query alone, i.e.

  SELECT * FROM table
  WHERE table.key IN ('value1', 'value2', ... , 'value1200' )


Re: optimizing large query with IN (...)

From
Christopher Kings-Lynne
Date:
>>UPDATE table SET table.par = table.par + 1
>>WHERE table.key IN ('value1', 'value2', ... , 'value1200' )
>
>
> How fast is the query alone, i.e.
>
>   SELECT * FROM table
>   WHERE table.key IN ('value1', 'value2', ... , 'value1200' )

Also, post the output of '\d table' and EXPLAIN ANALYZE UPDATE...

Chris


Re: optimizing large query with IN (...)

From
Steve Atkins
Date:
On Wed, Mar 10, 2004 at 12:35:15AM -0300, Marcus Andree S. Magalhaes wrote:
> Guys,
>
> I got a Java program to tune. It connects to a 7.4.1 postgresql server
> running Linux using JDBC.
>
> The program needs to update a counter on a somewhat large number of
> rows, about 1200 on a ~130k rows table. The query is something like
> the following:
>
> UPDATE table SET table.par = table.par + 1
> WHERE table.key IN ('value1', 'value2', ... , 'value1200' )
>
> This query runs on a  transaction (by issuing  a call to
> setAutoCommit(false)) and a commit() right after the query
> is sent to the backend.
>
> The process of committing and updating the values is painfully slow
> (no surprises here). Any ideas?

I posted an analysis of use of IN () like this a few weeks ago on
pgsql-general.

The approach you're using is optimal for < 3 values.

For any more than that, insert value1 ... value1200 into a temporary
table, then do

 UPDATE table SET table.par = table.par + 1
 WHERE table.key IN (SELECT value from temp_table);

Indexing the temporary table marginally increases the speed, but not
significantly.

Cheers,
  Steve



Re: optimizing large query with IN (...)

From
"Marcus Andree S. Magalhaes"
Date:
Hmm... from the 'performance' point of view, since the data comes from
a quite complex select statement, Isn't it better/quicker to have this
select replaced by a select into and creating a temporary database?



> The problem, as I understand it, is that 7.4 introduced massive
> improvements in handling moderately large in() clauses, as long as they
> can fit in sort_mem, and are provided by a subselect.
>
> So, creating a temp table with all the values in it and using in() on
> the  temp table may be a win:
>
> begin;
> create temp table t_ids(id int);
> insert into t_ids(id) values (123); <- repeat a few hundred times
> select * from maintable where id in (select id from t_ids);
> ...




Re: optimizing large query with IN (...)

From
"scott.marlowe"
Date:
I'm not sure exactly what you're saying here.  If the data in the in()
clause comes from a complex select, then just use the select in there, and
bypass the temporary table idea.

I'm not sure what a temporary database is, did you mean temporary table?
if so, then my above comment addresses that point.

On Wed, 10 Mar 2004, Marcus Andree S. Magalhaes wrote:

>
> Hmm... from the 'performance' point of view, since the data comes from
> a quite complex select statement, Isn't it better/quicker to have this
> select replaced by a select into and creating a temporary database?
>
>
>
> > The problem, as I understand it, is that 7.4 introduced massive
> > improvements in handling moderately large in() clauses, as long as they
> > can fit in sort_mem, and are provided by a subselect.
> >
> > So, creating a temp table with all the values in it and using in() on
> > the  temp table may be a win:
> >
> > begin;
> > create temp table t_ids(id int);
> > insert into t_ids(id) values (123); <- repeat a few hundred times
> > select * from maintable where id in (select id from t_ids);
> > ...
>
>
>
>


Re: optimizing large query with IN (...)

From
Steve Atkins
Date:
On Wed, Mar 10, 2004 at 02:02:23PM -0300, Marcus Andree S. Magalhaes wrote:

> Hmm... from the 'performance' point of view, since the data comes from
> a quite complex select statement, Isn't it better/quicker to have this
> select replaced by a select into and creating a temporary database?

Definitely - why loop the data into the application and back out again
if you don't need to?

> > The problem, as I understand it, is that 7.4 introduced massive
> > improvements in handling moderately large in() clauses, as long as they
> > can fit in sort_mem, and are provided by a subselect.
> >
> > So, creating a temp table with all the values in it and using in() on
> > the  temp table may be a win:
> >
> > begin;
> > create temp table t_ids(id int);
> > insert into t_ids(id) values (123); <- repeat a few hundred times
> > select * from maintable where id in (select id from t_ids);
> > ...

Cheers,
  Steve

Re: optimizing large query with IN (...)

From
"scott.marlowe"
Date:
On Wed, 10 Mar 2004, Marcus Andree S. Magalhaes wrote:

>
> Guys,
>
> I got a Java program to tune. It connects to a 7.4.1 postgresql server
> running Linux using JDBC.
>
> The program needs to update a counter on a somewhat large number of
> rows, about 1200 on a ~130k rows table. The query is something like
> the following:
>
> UPDATE table SET table.par = table.par + 1
> WHERE table.key IN ('value1', 'value2', ... , 'value1200' )
>
> This query runs on a  transaction (by issuing  a call to
> setAutoCommit(false)) and a commit() right after the query
> is sent to the backend.
>
> The process of committing and updating the values is painfully slow
> (no surprises here). Any ideas?

The problem, as I understand it, is that 7.4 introduced massive
improvements in handling moderately large in() clauses, as long as they
can fit in sort_mem, and are provided by a subselect.

So, creating a temp table with all the values in it and using in() on the
temp table may be a win:

begin;
create temp table t_ids(id int);
insert into t_ids(id) values (123); <- repeat a few hundred times
select * from maintable where id in (select id from t_ids);
...



Re: optimizing large query with IN (...)

From
Josh Berkus
Date:
Marcus,

> The problem, as I understand it, is that 7.4 introduced massive
> improvements in handling moderately large in() clauses, as long as they
> can fit in sort_mem, and are provided by a subselect.

Also, this problem may be fixed in 7.5, when it comes out.  It's a known
issue.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: optimizing large query with IN (...)

From
Joseph Shraibman
Date:
Marcus Andree S. Magalhaes wrote:
> Guys,
>
> I got a Java program to tune. It connects to a 7.4.1 postgresql server
> running Linux using JDBC.
>
> The program needs to update a counter on a somewhat large number of
> rows, about 1200 on a ~130k rows table. The query is something like
> the following:
>
> UPDATE table SET table.par = table.par + 1
> WHERE table.key IN ('value1', 'value2', ... , 'value1200' )
>

How often do you update this counter?  Each update requires adding a new
row to the table and invalidating the old one.  Then the old ones stick
around until the next vacuum.