Thread: "SELECT IN" Still Broken in 7.4b
I'm sure many on this list are sick of hearing about this problem, but it was on the fix list for 7.4, but doesn't appear to have been changed. You can see one of the many threads on the problem at: http://archives.postgresql.org/pgsql-sql/2003-05/msg00352.php Basically, queries of the form SELECT <rows> FROM <table> WHERE <row> IN (<list of values>) take forever for high numbers of rows in the IN clause. We've done timing on 7.3 and 7.4b and there is no speed improvement on these queries. Does anyone know what the status of this bug is? -- _______________________________________________________________________ Front Logic Inc. Tel: 306.653.2725 x14 226 Pacific Ave or 1.800.521.4510 x14 Saskatoon, SK Fax: 306.653.0972 S7K 1N9 Canada Cell: 306.717.2550 http://www.frontlogic.com mike.winter@frontlogic.com
On Wed, 20 Aug 2003, Mike Winter wrote: > I'm sure many on this list are sick of hearing about this problem, but it > was on the fix list for 7.4, but doesn't appear to have been changed. IN (subselect) was changed for 7.4 (although I'm not sure of the list mentions the difference). I don't know of any major changes to IN (valuelist) though.
On Wed, 20 Aug 2003, Stephan Szabo wrote: > > On Wed, 20 Aug 2003, Mike Winter wrote: > > > I'm sure many on this list are sick of hearing about this problem, but it > > was on the fix list for 7.4, but doesn't appear to have been changed. > > IN (subselect) was changed for 7.4 (although I'm not sure of the list > mentions the difference). I don't know of any major changes to IN > (valuelist) though. Thanks, Stephan. I was really hoping that the IN(valuelist) was going to be changed at the same time, because it really is unusable for anything over a couple of thousand values. -- _______________________________________________________________________ Front Logic Inc. Tel: 306.653.2725 x14 226 Pacific Ave or 1.800.521.4510 x14 Saskatoon, SK Fax: 306.653.0972 S7K 1N9 Canada Cell: 306.717.2550 http://www.frontlogic.com mike.winter@frontlogic.com
> Thanks, Stephan. I was really hoping that the IN(valuelist) was going to > be changed at the same time, because it really is unusable for anything > over a couple of thousand values. Changed to do what? I suppose that the ability to combine several index scans via a bitmap would help to linearize those, but that is far from an IN(valuelist) specific enhancement.
Ensure your IN list is unique. You might find better times by through an indexed temp table. On Wed, 2003-08-20 at 16:32, Mike Winter wrote: > I'm sure many on this list are sick of hearing about this problem, but it > was on the fix list for 7.4, but doesn't appear to have been changed. > > You can see one of the many threads on the problem at: > http://archives.postgresql.org/pgsql-sql/2003-05/msg00352.php > > Basically, queries of the form SELECT <rows> FROM <table> WHERE <row> IN > (<list of values>) take forever for high numbers of rows in the IN clause. > We've done timing on 7.3 and 7.4b and there is no speed improvement on > these queries. > > Does anyone know what the status of this bug is?
On Wed, 20 Aug 2003, Rod Taylor wrote: > > Thanks, Stephan. I was really hoping that the IN(valuelist) was going to > > be changed at the same time, because it really is unusable for anything > > over a couple of thousand values. > > Changed to do what? One possibility might be to act as if the valuelist was a table and do the IN as if it were that way, rather than treating it as a set of ORs. That would be basically like doing the temporary table solution, but without requiring the user to do it.
On Wed, 20 Aug 2003, Rod Taylor wrote: > Ensure your IN list is unique. You might find better times by through > an indexed temp table. That is what I ended up doing, but it's not a very elegant solution. MySQL does queries of this type orders of magnitudes faster than Postgres on large value lists, although I have no specific algorithmic solutions to offer for how to make it faster. I don't believe making the IN lists unique has any affect on performance. -- _______________________________________________________________________ Front Logic Inc. Tel: 306.653.2725 x14 226 Pacific Ave or 1.800.521.4510 x14 Saskatoon, SK Fax: 306.653.0972 S7K 1N9 Canada Cell: 306.717.2550 http://www.frontlogic.com mike.winter@frontlogic.com
Mike Winter wrote: > On Wed, 20 Aug 2003, Rod Taylor wrote: > >>Ensure your IN list is unique. You might find better times by through >>an indexed temp table. > > That is what I ended up doing, but it's not a very elegant solution. > MySQL does queries of this type orders of magnitudes faster than Postgres > on large value lists, although I have no specific algorithmic solutions to > offer for how to make it faster. I don't believe making the IN lists > unique has any affect on performance. > I have no idea whether it will be better or worse performance, but in 7.4 you could do: select blah from foo where id = any (ARRAY[list_of_literals]); Joe
Mike Winter <mike.winter@frontlogic.com> writes: > MySQL does queries of this type orders of magnitudes faster than Postgres > on large value lists, although I have no specific algorithmic solutions to > offer for how to make it faster. How large is "large", and what plan type are you getting (seq scan or multiple index scan)? Is it possible that the cost comes from planner overhead and not execution? Checking EXPLAIN ANALYZE reported time against actual elapsed time (cf psql's \timing option) would tell. regards, tom lane
On Wed, 2003-08-20 at 17:41, Stephan Szabo wrote: > On Wed, 20 Aug 2003, Rod Taylor wrote: > > > > Thanks, Stephan. I was really hoping that the IN(valuelist) was going to > > > be changed at the same time, because it really is unusable for anything > > > over a couple of thousand values. > > > > Changed to do what? > > One possibility might be to act as if the valuelist was a table and do the > IN as if it were that way, rather than treating it as a set of ORs. That > would be basically like doing the temporary table solution, but without > requiring the user to do it. Is the temp table version any faster? I realize it has a higher limit to the number of items you can have in the list.
On Wed, 20 Aug 2003, Rod Taylor wrote: > On Wed, 2003-08-20 at 17:41, Stephan Szabo wrote: > > On Wed, 20 Aug 2003, Rod Taylor wrote: > > > > > > Thanks, Stephan. I was really hoping that the IN(valuelist) was going to > > > > be changed at the same time, because it really is unusable for anything > > > > over a couple of thousand values. > > > > > > Changed to do what? > > > > One possibility might be to act as if the valuelist was a table and do the > > IN as if it were that way, rather than treating it as a set of ORs. That > > would be basically like doing the temporary table solution, but without > > requiring the user to do it. > > Is the temp table version any faster? I realize it has a higher limit > to the number of items you can have in the list. Within the scope of the new hashed IN stuff I believe so in at least some cases. I have a few million row table of integers where searching for values IN (~10000 values) takes longer than creating the temp table, copying into it and doing the in subquery. That's not a particularly meaningful test case, but sending the psql output to /dev/null gives me: create temp table/copy 10001 entries/select in subquery - .8 secselect in (value list 9998 entries) - ~ 2min 19 secexplainselect in (value list) - ~ 4.8 sec
On Wed, 20 Aug 2003, Stephan Szabo wrote: > > On Wed, 20 Aug 2003, Rod Taylor wrote: > > > > Thanks, Stephan. I was really hoping that the IN(valuelist) was going to > > > be changed at the same time, because it really is unusable for anything > > > over a couple of thousand values. > > > > Changed to do what? > > One possibility might be to act as if the valuelist was a table and do the > IN as if it were that way, rather than treating it as a set of ORs. That > would be basically like doing the temporary table solution, but without > requiring the user to do it. > for integers we use contrib/intarray as a workaround. In principle, it's possible to extend intarray to general array. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Stephan Szabo wrote: >On Wed, 20 Aug 2003, Rod Taylor wrote: >... > > >>Is the temp table version any faster? I realize it has a higher limit >>to the number of items you can have in the list. >> >> > >Within the scope of the new hashed IN stuff I believe so in at least some >cases. I have a few million row table of integers where searching for >values IN (~10000 values) takes longer than creating the temp table, >copying into it and doing the in subquery. That's not a particularly >meaningful test case, but sending the psql output to /dev/null gives me: ... > But where do your values come from in the first place? Couldn't you optimize your model so that you don't have to copy around such amounts of data? Regards, Dani
On Thu, 21 Aug 2003, Dani Oderbolz wrote: > Stephan Szabo wrote: > > >On Wed, 20 Aug 2003, Rod Taylor wrote: > >... > > > > > >>Is the temp table version any faster? I realize it has a higher limit > >>to the number of items you can have in the list. > >> > >> > > > >Within the scope of the new hashed IN stuff I believe so in at least some > >cases. I have a few million row table of integers where searching for > >values IN (~10000 values) takes longer than creating the temp table, > >copying into it and doing the in subquery. That's not a particularly > >meaningful test case, but sending the psql output to /dev/null gives me: ... > > > But where do your values come from in the first place? > Couldn't you optimize your model so that you don't have to copy around > such amounts of data? I wasn't the OP, I was doing a simple test as a comparison between the two forms of the queries to see if making a temp table and populating it and then doing the subselect form could ever be faster than the current conversion for valuelists to a sequence of or conditions. In 7.3, it probably was not possible for a conversion to in subselect to be faster, but with the new hash subquery stuff it was worth trying again.
On Wed, Aug 20, 2003 at 04:32:19PM -0400, Tom Lane wrote: > > against actual elapsed time (cf psql's \timing option) would tell. What is measured by the \timing option? The figures reported are slightly larger than those loged when the log_duration parameter is true. => select count(*) from foo;count --------362921 (1 row) Time: 5500.88 ms while in syslog, the timing was "LOG: duration: 5.499783 sec". There is a difference of 1.097 ms. In both cases what is this duration? Real time or CPU time? -- %!PS 297.6 420.9 translate 90 rotate 0 setgray gsave 0 1 1{pop 0 180 moveto 100 180 170 100 170 -10 curveto 180 -9 180 -9 190 -10 curveto 190 100 100 180 0 180 curveto fill 180 rotate}for grestore/Bookman-LightItalic findfont 240 scalefont setfont -151.536392 -63.7998886 moveto (bp)show showpage
> What is measured by the \timing option? The figures reported > are slightly larger than those loged when the log_duration parameter > is true. The time of the psql client. It will include round trip activity including network overhead.
Bertrand Petit <pgsql-sql@phoe.frmug.org> writes: > On Wed, Aug 20, 2003 at 04:32:19PM -0400, Tom Lane wrote: >> against actual elapsed time (cf psql's \timing option) would tell. > What is measured by the \timing option? Elapsed time ... as seen by the client, of course. > The figures reported > are slightly larger than those loged when the log_duration parameter > is true. Yeah, that's what I'd expect, especially if there is a network hop involved. regards, tom lane
On Fri, Aug 22, 2003 at 08:50:15AM -0400, Tom Lane wrote: > Bertrand Petit <pgsql-sql@phoe.frmug.org> writes: > > On Wed, Aug 20, 2003 at 04:32:19PM -0400, Tom Lane wrote: > >> against actual elapsed time (cf psql's \timing option) would tell. > > > What is measured by the \timing option? > > Elapsed time ... as seen by the client, of course. Is there a way to obtain the CPU time consumed by a query, including a breakdown between user- and kernel-land? -- %!PS 297.6 420.9 translate 90 rotate 0 setgray gsave 0 1 1{pop 0 180 moveto 100 180 170 100 170 -10 curveto 180 -9 180 -9 190 -10 curveto 190 100 100 180 0 180 curveto fill 180 rotate}for grestore/Bookman-LightItalic findfont 240 scalefont setfont -151.536392 -63.7998886 moveto (bp)show showpage
Mike Winter <mike.winter@frontlogic.com> writes: > Basically, queries of the form SELECT <rows> FROM <table> WHERE <row> IN > (<list of values>) take forever for high numbers of rows in the IN clause. > We've done timing on 7.3 and 7.4b and there is no speed improvement on > these queries. > Does anyone know what the status of this bug is? Try it in CVS tip ... regards, tom lane