Thread: "SELECT IN" Still Broken in 7.4b

"SELECT IN" Still Broken in 7.4b

From

Mike Winter

Date:

20 August 2003, 13:19:29

I'm sure many on this list are sick of hearing about this problem, but it
was on the fix list for 7.4, but doesn't appear to have been changed.

You can see one of the many threads on the problem at:
http://archives.postgresql.org/pgsql-sql/2003-05/msg00352.php

Basically, queries of the form SELECT <rows> FROM <table> WHERE <row> IN
(<list of values>) take forever for high numbers of rows in the IN clause.
We've done timing on 7.3 and 7.4b and there is no speed improvement on
these queries.

Does anyone know what the status of this bug is?
-- 
_______________________________________________________________________
Front Logic Inc.                                  Tel: 306.653.2725 x14
226 Pacific Ave                                   or 1.800.521.4510 x14
Saskatoon, SK                                     Fax: 306.653.0972
S7K 1N9  Canada                      Cell: 306.717.2550
http://www.frontlogic.com                   mike.winter@frontlogic.com

Re: "SELECT IN" Still Broken in 7.4b

From

Stephan Szabo

Date:

20 August 2003, 13:27:47

On Wed, 20 Aug 2003, Mike Winter wrote:

> I'm sure many on this list are sick of hearing about this problem, but it
> was on the fix list for 7.4, but doesn't appear to have been changed.

IN (subselect) was changed for 7.4 (although I'm not sure of the list
mentions the difference). I don't know of any major changes to IN
(valuelist) though.

Re: "SELECT IN" Still Broken in 7.4b

From

Mike Winter

Date:

20 August 2003, 13:32:19

On Wed, 20 Aug 2003, Stephan Szabo wrote:

>
> On Wed, 20 Aug 2003, Mike Winter wrote:
>
> > I'm sure many on this list are sick of hearing about this problem, but it
> > was on the fix list for 7.4, but doesn't appear to have been changed.
>
> IN (subselect) was changed for 7.4 (although I'm not sure of the list
> mentions the difference). I don't know of any major changes to IN
> (valuelist) though.

Thanks, Stephan.  I was really hoping that the IN(valuelist) was going to
be changed at the same time, because it really is unusable for anything
over a couple of thousand values.
-- 
_______________________________________________________________________
Front Logic Inc.                                  Tel: 306.653.2725 x14
226 Pacific Ave                                   or 1.800.521.4510 x14
Saskatoon, SK                                     Fax: 306.653.0972
S7K 1N9  Canada                      Cell: 306.717.2550
http://www.frontlogic.com                   mike.winter@frontlogic.com

Re: "SELECT IN" Still Broken in 7.4b

From

Rod Taylor

Date:

20 August 2003, 14:10:05

> Thanks, Stephan.  I was really hoping that the IN(valuelist) was going to
> be changed at the same time, because it really is unusable for anything
> over a couple of thousand values.

Changed to do what?

I suppose that the ability to combine several index scans via a bitmap
would help to linearize those, but that is far from an IN(valuelist)
specific enhancement.

Re: "SELECT IN" Still Broken in 7.4b

From

Rod Taylor

Date:

20 August 2003, 14:10:56

Ensure your IN list is unique.  You might find better times by through
an indexed temp table.

On Wed, 2003-08-20 at 16:32, Mike Winter wrote:
> I'm sure many on this list are sick of hearing about this problem, but it
> was on the fix list for 7.4, but doesn't appear to have been changed.
>
> You can see one of the many threads on the problem at:
> http://archives.postgresql.org/pgsql-sql/2003-05/msg00352.php
>
> Basically, queries of the form SELECT <rows> FROM <table> WHERE <row> IN
> (<list of values>) take forever for high numbers of rows in the IN clause.
> We've done timing on 7.3 and 7.4b and there is no speed improvement on
> these queries.
>
> Does anyone know what the status of this bug is?

Re: "SELECT IN" Still Broken in 7.4b

From

Stephan Szabo

Date:

20 August 2003, 14:41:22

On Wed, 20 Aug 2003, Rod Taylor wrote:

> > Thanks, Stephan.  I was really hoping that the IN(valuelist) was going to
> > be changed at the same time, because it really is unusable for anything
> > over a couple of thousand values.
>
> Changed to do what?

One possibility might be to act as if the valuelist was a table and do the
IN as if it were that way, rather than treating it as a set of ORs.  That
would be basically like doing the temporary table solution, but without
requiring the user to do it.

Re: "SELECT IN" Still Broken in 7.4b

From

Mike Winter

Date:

20 August 2003, 15:37:57

On Wed, 20 Aug 2003, Rod Taylor wrote:

> Ensure your IN list is unique.  You might find better times by through
> an indexed temp table.

That is what I ended up doing, but it's not a very elegant solution.
MySQL does queries of this type orders of magnitudes faster than Postgres
on large value lists, although I have no specific algorithmic solutions to
offer for how to make it faster.  I don't believe making the IN lists
unique has any affect on performance.

-- 
_______________________________________________________________________
Front Logic Inc.                                  Tel: 306.653.2725 x14
226 Pacific Ave                                   or 1.800.521.4510 x14
Saskatoon, SK                                     Fax: 306.653.0972
S7K 1N9  Canada                      Cell: 306.717.2550
http://www.frontlogic.com                   mike.winter@frontlogic.com

Re: "SELECT IN" Still Broken in 7.4b

From

Joe Conway

Date:

20 August 2003, 16:31:24

Mike Winter wrote:
> On Wed, 20 Aug 2003, Rod Taylor wrote:
> 
>>Ensure your IN list is unique.  You might find better times by through
>>an indexed temp table.
> 
> That is what I ended up doing, but it's not a very elegant solution.
> MySQL does queries of this type orders of magnitudes faster than Postgres
> on large value lists, although I have no specific algorithmic solutions to
> offer for how to make it faster.  I don't believe making the IN lists
> unique has any affect on performance.
> 

I have no idea whether it will be better or worse performance, but in 
7.4 you could do:

select blah from foo where id = any (ARRAY[list_of_literals]);

Joe

Re: "SELECT IN" Still Broken in 7.4b

From

Tom Lane

Date:

20 August 2003, 17:32:35

Mike Winter <mike.winter@frontlogic.com> writes:
> MySQL does queries of this type orders of magnitudes faster than Postgres
> on large value lists, although I have no specific algorithmic solutions to
> offer for how to make it faster.

How large is "large", and what plan type are you getting (seq scan or
multiple index scan)?  Is it possible that the cost comes from planner
overhead and not execution?  Checking EXPLAIN ANALYZE reported time
against actual elapsed time (cf psql's \timing option) would tell.
        regards, tom lane

Re: "SELECT IN" Still Broken in 7.4b

From

Rod Taylor

Date:

20 August 2003, 17:41:34

On Wed, 2003-08-20 at 17:41, Stephan Szabo wrote:
> On Wed, 20 Aug 2003, Rod Taylor wrote:
>
> > > Thanks, Stephan.  I was really hoping that the IN(valuelist) was going to
> > > be changed at the same time, because it really is unusable for anything
> > > over a couple of thousand values.
> >
> > Changed to do what?
>
> One possibility might be to act as if the valuelist was a table and do the
> IN as if it were that way, rather than treating it as a set of ORs.  That
> would be basically like doing the temporary table solution, but without
> requiring the user to do it.

Is the temp table version any faster?  I realize it has a higher limit
to the number of items you can have in the list.

Re: "SELECT IN" Still Broken in 7.4b

From

Stephan Szabo

Date:

20 August 2003, 18:55:33

On Wed, 20 Aug 2003, Rod Taylor wrote:

> On Wed, 2003-08-20 at 17:41, Stephan Szabo wrote:
> > On Wed, 20 Aug 2003, Rod Taylor wrote:
> >
> > > > Thanks, Stephan.  I was really hoping that the IN(valuelist) was going to
> > > > be changed at the same time, because it really is unusable for anything
> > > > over a couple of thousand values.
> > >
> > > Changed to do what?
> >
> > One possibility might be to act as if the valuelist was a table and do the
> > IN as if it were that way, rather than treating it as a set of ORs.  That
> > would be basically like doing the temporary table solution, but without
> > requiring the user to do it.
>
> Is the temp table version any faster?  I realize it has a higher limit
> to the number of items you can have in the list.

Within the scope of the new hashed IN stuff I believe so in at least some
cases.  I have a few million row table of integers where searching for
values IN (~10000 values) takes longer than creating the temp table,
copying into it and doing the in subquery.  That's not a particularly
meaningful test case, but sending the psql output to /dev/null gives me:
create temp table/copy 10001 entries/select in subquery - .8 secselect in (value list 9998 entries) - ~ 2min 19
secexplainselect in (value list) - ~ 4.8 sec

Re: "SELECT IN" Still Broken in 7.4b

From

Oleg Bartunov

Date:

21 August 2003, 04:39:36

On Wed, 20 Aug 2003, Stephan Szabo wrote:

>
> On Wed, 20 Aug 2003, Rod Taylor wrote:
>
> > > Thanks, Stephan.  I was really hoping that the IN(valuelist) was going to
> > > be changed at the same time, because it really is unusable for anything
> > > over a couple of thousand values.
> >
> > Changed to do what?
>
> One possibility might be to act as if the valuelist was a table and do the
> IN as if it were that way, rather than treating it as a set of ORs.  That
> would be basically like doing the temporary table solution, but without
> requiring the user to do it.
>

for integers we use contrib/intarray as a workaround. In principle,
it's possible to extend intarray to general array.

>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: "SELECT IN" Still Broken in 7.4b

From

Dani Oderbolz

Date:

21 August 2003, 07:05:36

Stephan Szabo wrote:

>On Wed, 20 Aug 2003, Rod Taylor wrote:
>...
>  
>
>>Is the temp table version any faster?  I realize it has a higher limit
>>to the number of items you can have in the list.
>>    
>>
>
>Within the scope of the new hashed IN stuff I believe so in at least some
>cases.  I have a few million row table of integers where searching for
>values IN (~10000 values) takes longer than creating the temp table,
>copying into it and doing the in subquery.  That's not a particularly
>meaningful test case, but sending the psql output to /dev/null gives me: ...
>
But where do your values come from in the first place?
Couldn't you optimize your model so that you don't have to copy around
such amounts of data?

Regards,
Dani

Re: "SELECT IN" Still Broken in 7.4b

From

Stephan Szabo

Date:

21 August 2003, 14:29:27

On Thu, 21 Aug 2003, Dani Oderbolz wrote:

> Stephan Szabo wrote:
>
> >On Wed, 20 Aug 2003, Rod Taylor wrote:
> >...
> >
> >
> >>Is the temp table version any faster?  I realize it has a higher limit
> >>to the number of items you can have in the list.
> >>
> >>
> >
> >Within the scope of the new hashed IN stuff I believe so in at least some
> >cases.  I have a few million row table of integers where searching for
> >values IN (~10000 values) takes longer than creating the temp table,
> >copying into it and doing the in subquery.  That's not a particularly
> >meaningful test case, but sending the psql output to /dev/null gives me: ...
> >
> But where do your values come from in the first place?
> Couldn't you optimize your model so that you don't have to copy around
> such amounts of data?

I wasn't the OP, I was doing a simple test as a comparison between
the two forms of the queries to see if making a temp table and populating
it and then doing the subselect form could ever be faster than the current
conversion for valuelists to a sequence of or conditions.  In 7.3, it
probably was not possible for a conversion to in subselect to be faster,
but with the new hash subquery stuff it was worth trying again.

Re: "SELECT IN" Still Broken in 7.4b

From

Bertrand Petit

Date:

22 August 2003, 03:37:46

On Wed, Aug 20, 2003 at 04:32:19PM -0400, Tom Lane wrote:
>
> against actual elapsed time (cf psql's \timing option) would tell.
What is measured by the \timing option? The figures reported
are slightly larger than those loged when the log_duration parameter
is true.

=> select count(*) from foo;count  
--------362921
(1 row)

Time: 5500.88 ms

while in syslog, the timing was "LOG: duration: 5.499783 sec". There
is a difference of 1.097 ms.
In both cases what is this duration? Real time or CPU time?

-- 
%!PS
297.6 420.9 translate 90 rotate 0 setgray gsave 0 1 1{pop 0 180 moveto 100
180 170 100 170 -10 curveto 180 -9 180 -9 190 -10 curveto 190 100 100 180
0 180 curveto fill 180 rotate}for grestore/Bookman-LightItalic findfont
240 scalefont setfont -151.536392 -63.7998886 moveto (bp)show showpage

Re: "SELECT IN" Still Broken in 7.4b

From

Rod Taylor

Date:

22 August 2003, 09:38:16

>     What is measured by the \timing option? The figures reported
> are slightly larger than those loged when the log_duration parameter
> is true.

The time of the psql client.  It will include round trip activity
including network overhead.

Re: "SELECT IN" Still Broken in 7.4b

From

Tom Lane

Date:

22 August 2003, 09:50:48

Bertrand Petit <pgsql-sql@phoe.frmug.org> writes:
> On Wed, Aug 20, 2003 at 04:32:19PM -0400, Tom Lane wrote:
>> against actual elapsed time (cf psql's \timing option) would tell.

>     What is measured by the \timing option?

Elapsed time ... as seen by the client, of course.

> The figures reported
> are slightly larger than those loged when the log_duration parameter
> is true.

Yeah, that's what I'd expect, especially if there is a network hop
involved.
        regards, tom lane

Re: "SELECT IN" Still Broken in 7.4b

From

Bertrand Petit

Date:

22 August 2003, 10:11:44

On Fri, Aug 22, 2003 at 08:50:15AM -0400, Tom Lane wrote:
> Bertrand Petit <pgsql-sql@phoe.frmug.org> writes:
> > On Wed, Aug 20, 2003 at 04:32:19PM -0400, Tom Lane wrote:
> >> against actual elapsed time (cf psql's \timing option) would tell.
> 
> >     What is measured by the \timing option?
> 
> Elapsed time ... as seen by the client, of course.
Is there a way to obtain the CPU time consumed by a query,
including a breakdown between user- and kernel-land?

-- 
%!PS
297.6 420.9 translate 90 rotate 0 setgray gsave 0 1 1{pop 0 180 moveto 100
180 170 100 170 -10 curveto 180 -9 180 -9 190 -10 curveto 190 100 100 180
0 180 curveto fill 180 rotate}for grestore/Bookman-LightItalic findfont
240 scalefont setfont -151.536392 -63.7998886 moveto (bp)show showpage

Re: "SELECT IN" Still Broken in 7.4b

From

Tom Lane

Date:

24 August 2003, 18:39:20

Mike Winter <mike.winter@frontlogic.com> writes:
> Basically, queries of the form SELECT <rows> FROM <table> WHERE <row> IN
> (<list of values>) take forever for high numbers of rows in the IN clause.
> We've done timing on 7.3 and 7.4b and there is no speed improvement on
> these queries.

> Does anyone know what the status of this bug is?

Try it in CVS tip ...
        regards, tom lane