Thread: Select where id in (LARGE LIST) ?
I am working on a gui application which has a list of ids of selected items. To perform an operation on the selected items I do a "select/update .. where id in(...)". There seems to be a limit of 100-200 items in the list. Is there a way of doing this with large lists, maybe 10,000, 100,000 long? The best solution I have so far is to create a selection table and write the select out to that and perform a join but the 100,000 inserts are slow. The ids are int8(long), so not talking large amounts of data, couple Mb at most. Database and client running on same machine over localhost. Many Thanks Jasper
Jasper, You can chunk your operation. That means to only use 100 entries in one run, and repeat it until all ids were processed. Use a prepared statement, that will save you some overhead. The last chunk will have less entries than the parameter placeholders, so you will have to build a special last statement, or to set the superfluous parameters to null or to one of the values from the last chunk, depends on what kind of query you have. We do all our data import/export this way, and it works fine. Cheers, Csaba. On Fri, 2005-07-29 at 17:30, Jasper Potts wrote: > I am working on a gui application which has a list of ids of selected > items. To perform an operation on the selected items I do a > "select/update .. where id in(...)". There seems to be a limit of > 100-200 items in the list. Is there a way of doing this with large > lists, maybe 10,000, 100,000 long? > > The best solution I have so far is to create a selection table and write > the select out to that and perform a join but the 100,000 inserts are slow. > > The ids are int8(long), so not talking large amounts of data, couple Mb > at most. Database and client running on same machine over localhost. > > Many Thanks > > Jasper > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
I have been trying the chunk method and it wins in cases where the number of ids is not too high. I was comparing: (1) joining the main table with a table that had a column with ids and a boolean column for selected, both id columns have indexes. (2) select over main table where id in (...) done in 100s with prepared statement and results accumulated in java with a select that did a sum() of another column the results were: No. of Selected Items | Join Select Time in sec | Chunk Select Time in sec 30 | 0.4 | 0.007 4000 | 0.5 | 0.24 30000 | 0.7 | 1.12 All of these were with 30,000 rows in main table from JDBC. These results don't take into account the time it takes to clear then populate the select table with selection data from client. At the moment this can take 1-2 seconds which is far from interactive for the user. Which is why I am looking for a better method. Any other suggestions? :-) I would like to push the data size up to a million, which means the user could in theory select a million rows and apply operation. This is going to get very slow with the chunk method. Many Thanks Jasper Csaba Nagy wrote: >Jasper, > >You can chunk your operation. That means to only use 100 entries in one >run, and repeat it until all ids were processed. Use a prepared >statement, that will save you some overhead. The last chunk will have >less entries than the parameter placeholders, so you will have to build >a special last statement, or to set the superfluous parameters to null >or to one of the values from the last chunk, depends on what kind of >query you have. We do all our data import/export this way, and it works >fine. > >Cheers, >Csaba. > > >On Fri, 2005-07-29 at 17:30, Jasper Potts wrote: > > >>I am working on a gui application which has a list of ids of selected >>items. To perform an operation on the selected items I do a >>"select/update .. where id in(...)". There seems to be a limit of >>100-200 items in the list. Is there a way of doing this with large >>lists, maybe 10,000, 100,000 long? >> >>The best solution I have so far is to create a selection table and write >>the select out to that and perform a join but the 100,000 inserts are slow. >> >>The ids are int8(long), so not talking large amounts of data, couple Mb >>at most. Database and client running on same machine over localhost. >> >>Many Thanks >> >>Jasper >> >>---------------------------(end of broadcast)--------------------------- >>TIP 9: In versions below 8.0, the planner will ignore your desire to >> choose an index scan if your joining column's datatypes do not >> match >> >> > > >
Jasper Potts wrote: > The best solution I have so far is to create a selection table and write > the select out to that and perform a join but the 100,000 inserts are slow. How are you constructing this table? If you use batched inserts with autocommit off, it should be fairly fast; I get 8000+ inserts per second over larger data sets. Can you move the selection list into the DB semi-permanently? Really, the underlying problem is that you need to move a 100,000-item list to the database on every SELECT.. if you can have the list "already there" it's going to be faster. -O
I don't quite understand what you're doing here, but smells to me for something which shouldn't be interactive in the first place. If some query expectedly exceeds a few seconds, we make the operation asynchronous, i.e. the user starts it, gets a page saying the thing is currently being done, and then he receives some kind of notification when the thing is finished, either on the web page (by periodically reloading) or by email for really long lasting things. Now the chunking method does have an overhead indeed, but it has a set of good properties too: it's easily interruptible, it won't bug your database down (it will let some other things go too between the chunks), and you won't get connection timeouts for really long operations... it's just more manageable in many respects. And if you do have some long lasting things, make sure you won't let the user hit reload 50 times... an asynchronous way of doing it will help in this regard too, cause then you know what the user did and reload will only show the status. HTH, Csaba. On Fri, 2005-07-29 at 21:18, Jasper Potts wrote: > I have been trying the chunk method and it wins in cases where the > number of ids is not too high. I was comparing: > > (1) joining the main table with a table that had a column with ids and a > boolean column for selected, both id columns have indexes. > (2) select over main table where id in (...) done in 100s with prepared > statement and results accumulated in java > > with a select that did a sum() of another column the results were: > > No. of Selected Items | Join Select Time in sec | Chunk Select > Time in sec > 30 | 0.4 > | 0.007 > 4000 | 0.5 > | 0.24 > 30000 | 0.7 | > 1.12 > > All of these were with 30,000 rows in main table from JDBC. > > These results don't take into account the time it takes to clear then > populate the select table with selection data from client. At the moment > this can take 1-2 seconds which is far from interactive for the user. > Which is why I am looking for a better method. > > Any other suggestions? :-) > > I would like to push the data size up to a million, which means the user > could in theory select a million rows and apply operation. This is going > to get very slow with the chunk method. > > Many Thanks > > Jasper > > Csaba Nagy wrote: > > >Jasper, > > > >You can chunk your operation. That means to only use 100 entries in one > >run, and repeat it until all ids were processed. Use a prepared > >statement, that will save you some overhead. The last chunk will have > >less entries than the parameter placeholders, so you will have to build > >a special last statement, or to set the superfluous parameters to null > >or to one of the values from the last chunk, depends on what kind of > >query you have. We do all our data import/export this way, and it works > >fine. > > > >Cheers, > >Csaba. > > > > > >On Fri, 2005-07-29 at 17:30, Jasper Potts wrote: > > > > > >>I am working on a gui application which has a list of ids of selected > >>items. To perform an operation on the selected items I do a > >>"select/update .. where id in(...)". There seems to be a limit of > >>100-200 items in the list. Is there a way of doing this with large > >>lists, maybe 10,000, 100,000 long? > >> > >>The best solution I have so far is to create a selection table and write > >>the select out to that and perform a join but the 100,000 inserts are slow. > >> > >>The ids are int8(long), so not talking large amounts of data, couple Mb > >>at most. Database and client running on same machine over localhost. > >> > >>Many Thanks > >> > >>Jasper > >> > >>---------------------------(end of broadcast)--------------------------- > >>TIP 9: In versions below 8.0, the planner will ignore your desire to > >> choose an index scan if your joining column's datatypes do not > >> match > >> > >> > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend
We are working on a desktop(Swing) application, postgresql is running embedded in the application and is hence single user. We have a scrollable view of grid of icons much like explorer. The data set is a list of item ids resulting from a complex query that can take 30 seconds. This data set of the query results are stored in table in the database. The user then needs to be able to browse(scroll) through the list of items and make a selection. Then they can perform operations on that selection. Those operations require a way of the selection being used in a database query. The options are: (1) Store the selection on memory on the client. (2) Store the selection as a column in the query results table (boolean maybe) (3) Hybrid solution where the selection is in memory and in the database. The problem with (1) is how to transfer the selection to the db if it gets large. This is where the chunking comes in. The problem with the pure selection in the database route(2) is the simple selection operations like clicking on an item to select only it. Are too slow. (3) could be good but is complex to implement with Threading/Locking etc. My current plan is to go with (1) but change the storage model for the selection. The first idea was just a set of ids of selected items, problem is if that is 100,000 then is hard to transfer to the WHERE part of a query. The new idea is to store it as a list of ranges eg. (26-32,143-198,10922-10923) this could then be sent as a "WHERE (row >=26 and row <=32) or (row >=143 and row <=198) or (row >=10922 and row <=10923)". As the user has to choose each range by hand it is unlikely there could be more than 100 ranges. The big question here is whats the max length of a query sent though JDBC to Postgresql??? Many Thanks Jasper Csaba Nagy wrote: >I don't quite understand what you're doing here, but smells to me for >something which shouldn't be interactive in the first place. If some >query expectedly exceeds a few seconds, we make the operation >asynchronous, i.e. the user starts it, gets a page saying the thing is >currently being done, and then he receives some kind of notification >when the thing is finished, either on the web page (by periodically >reloading) or by email for really long lasting things. >Now the chunking method does have an overhead indeed, but it has a set >of good properties too: it's easily interruptible, it won't bug your >database down (it will let some other things go too between the chunks), >and you won't get connection timeouts for really long operations... it's >just more manageable in many respects. >And if you do have some long lasting things, make sure you won't let the >user hit reload 50 times... an asynchronous way of doing it will help in >this regard too, cause then you know what the user did and reload will >only show the status. > >HTH, >Csaba. > > > >On Fri, 2005-07-29 at 21:18, Jasper Potts wrote: > > >>I have been trying the chunk method and it wins in cases where the >>number of ids is not too high. I was comparing: >> >>(1) joining the main table with a table that had a column with ids and a >>boolean column for selected, both id columns have indexes. >>(2) select over main table where id in (...) done in 100s with prepared >>statement and results accumulated in java >> >>with a select that did a sum() of another column the results were: >> >>No. of Selected Items | Join Select Time in sec | Chunk Select >>Time in sec >>30 | 0.4 >>| 0.007 >>4000 | 0.5 >>| 0.24 >>30000 | 0.7 | >>1.12 >> >>All of these were with 30,000 rows in main table from JDBC. >> >>These results don't take into account the time it takes to clear then >>populate the select table with selection data from client. At the moment >>this can take 1-2 seconds which is far from interactive for the user. >>Which is why I am looking for a better method. >> >>Any other suggestions? :-) >> >>I would like to push the data size up to a million, which means the user >>could in theory select a million rows and apply operation. This is going >>to get very slow with the chunk method. >> >>Many Thanks >> >>Jasper >> >>Csaba Nagy wrote: >> >> >> >>>Jasper, >>> >>>You can chunk your operation. That means to only use 100 entries in one >>>run, and repeat it until all ids were processed. Use a prepared >>>statement, that will save you some overhead. The last chunk will have >>>less entries than the parameter placeholders, so you will have to build >>>a special last statement, or to set the superfluous parameters to null >>>or to one of the values from the last chunk, depends on what kind of >>>query you have. We do all our data import/export this way, and it works >>>fine. >>> >>>Cheers, >>>Csaba. >>> >>> >>>On Fri, 2005-07-29 at 17:30, Jasper Potts wrote: >>> >>> >>> >>> >>>>I am working on a gui application which has a list of ids of selected >>>>items. To perform an operation on the selected items I do a >>>>"select/update .. where id in(...)". There seems to be a limit of >>>>100-200 items in the list. Is there a way of doing this with large >>>>lists, maybe 10,000, 100,000 long? >>>> >>>>The best solution I have so far is to create a selection table and write >>>>the select out to that and perform a join but the 100,000 inserts are slow. >>>> >>>>The ids are int8(long), so not talking large amounts of data, couple Mb >>>>at most. Database and client running on same machine over localhost. >>>> >>>>Many Thanks >>>> >>>>Jasper >>>> >>>>---------------------------(end of broadcast)--------------------------- >>>>TIP 9: In versions below 8.0, the planner will ignore your desire to >>>> choose an index scan if your joining column's datatypes do not >>>> match >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>---------------------------(end of broadcast)--------------------------- >>TIP 6: explain analyze is your friend >> >> > > >
Jasper Potts wrote: > The big question here is whats the max length of a query sent though > JDBC to Postgresql??? You'll hit backend limits (I'm not sure if there are any explicit limits there other than available memory) before you hit any protocol limits; the protocol limits kick in at around 1GB.. -O
In this case I wouldn't go with a boolean flag, setting it on a row will actually insert a new row in your table (that's what the updates really do). Then you're better off with the range approach, and keep your ranges in memory. I have no idea how big the query can get, I guess it should be somewhere in the documentation at some chapter dealing with limitations... Cheers, Csaba. On Tue, 2005-08-02 at 11:23, Jasper Potts wrote: > We are working on a desktop(Swing) application, postgresql is running > embedded in the application and is hence single user. We have a > scrollable view of grid of icons much like explorer. The data set is a > list of item ids resulting from a complex query that can take 30 > seconds. This data set of the query results are stored in table in the > database. The user then needs to be able to browse(scroll) through the > list of items and make a selection. Then they can perform operations on > that selection. Those operations require a way of the selection being > used in a database query. The options are: > > (1) Store the selection on memory on the client. > (2) Store the selection as a column in the query results table > (boolean maybe) > (3) Hybrid solution where the selection is in memory and in the database. > > The problem with (1) is how to transfer the selection to the db if it > gets large. This is where the chunking comes in. > The problem with the pure selection in the database route(2) is the > simple selection operations like clicking on an item to select only it. > Are too slow. > (3) could be good but is complex to implement with Threading/Locking etc. > > My current plan is to go with (1) but change the storage model for the > selection. The first idea was just a set of ids of selected items, > problem is if that is 100,000 then is hard to transfer to the WHERE part > of a query. The new idea is to store it as a list of ranges eg. > (26-32,143-198,10922-10923) this could then be sent as a "WHERE (row > >=26 and row <=32) or (row >=143 and row <=198) or (row >=10922 and row > <=10923)". As the user has to choose each range by hand it is unlikely > there could be more than 100 ranges. > > The big question here is whats the max length of a query sent though > JDBC to Postgresql??? > > Many Thanks > > Jasper > > Csaba Nagy wrote: > > >I don't quite understand what you're doing here, but smells to me for > >something which shouldn't be interactive in the first place. If some > >query expectedly exceeds a few seconds, we make the operation > >asynchronous, i.e. the user starts it, gets a page saying the thing is > >currently being done, and then he receives some kind of notification > >when the thing is finished, either on the web page (by periodically > >reloading) or by email for really long lasting things. > >Now the chunking method does have an overhead indeed, but it has a set > >of good properties too: it's easily interruptible, it won't bug your > >database down (it will let some other things go too between the chunks), > >and you won't get connection timeouts for really long operations... it's > >just more manageable in many respects. > >And if you do have some long lasting things, make sure you won't let the > >user hit reload 50 times... an asynchronous way of doing it will help in > >this regard too, cause then you know what the user did and reload will > >only show the status. > > > >HTH, > >Csaba. > > > > > > > >On Fri, 2005-07-29 at 21:18, Jasper Potts wrote: > > > > > >>I have been trying the chunk method and it wins in cases where the > >>number of ids is not too high. I was comparing: > >> > >>(1) joining the main table with a table that had a column with ids and a > >>boolean column for selected, both id columns have indexes. > >>(2) select over main table where id in (...) done in 100s with prepared > >>statement and results accumulated in java > >> > >>with a select that did a sum() of another column the results were: > >> > >>No. of Selected Items | Join Select Time in sec | Chunk Select > >>Time in sec > >>30 | 0.4 > >>| 0.007 > >>4000 | 0.5 > >>| 0.24 > >>30000 | 0.7 | > >>1.12 > >> > >>All of these were with 30,000 rows in main table from JDBC. > >> > >>These results don't take into account the time it takes to clear then > >>populate the select table with selection data from client. At the moment > >>this can take 1-2 seconds which is far from interactive for the user. > >>Which is why I am looking for a better method. > >> > >>Any other suggestions? :-) > >> > >>I would like to push the data size up to a million, which means the user > >>could in theory select a million rows and apply operation. This is going > >>to get very slow with the chunk method. > >> > >>Many Thanks > >> > >>Jasper > >> > >>Csaba Nagy wrote: > >> > >> > >> > >>>Jasper, > >>> > >>>You can chunk your operation. That means to only use 100 entries in one > >>>run, and repeat it until all ids were processed. Use a prepared > >>>statement, that will save you some overhead. The last chunk will have > >>>less entries than the parameter placeholders, so you will have to build > >>>a special last statement, or to set the superfluous parameters to null > >>>or to one of the values from the last chunk, depends on what kind of > >>>query you have. We do all our data import/export this way, and it works > >>>fine. > >>> > >>>Cheers, > >>>Csaba. > >>> > >>> > >>>On Fri, 2005-07-29 at 17:30, Jasper Potts wrote: > >>> > >>> > >>> > >>> > >>>>I am working on a gui application which has a list of ids of selected > >>>>items. To perform an operation on the selected items I do a > >>>>"select/update .. where id in(...)". There seems to be a limit of > >>>>100-200 items in the list. Is there a way of doing this with large > >>>>lists, maybe 10,000, 100,000 long? > >>>> > >>>>The best solution I have so far is to create a selection table and write > >>>>the select out to that and perform a join but the 100,000 inserts are slow. > >>>> > >>>>The ids are int8(long), so not talking large amounts of data, couple Mb > >>>>at most. Database and client running on same machine over localhost. > >>>> > >>>>Many Thanks > >>>> > >>>>Jasper > >>>> > >>>>---------------------------(end of broadcast)--------------------------- > >>>>TIP 9: In versions below 8.0, the planner will ignore your desire to > >>>> choose an index scan if your joining column's datatypes do not > >>>> match > >>>> > >>>> > >>>> > >>>> > >>> > >>> > >>> > >>> > >>---------------------------(end of broadcast)--------------------------- > >>TIP 6: explain analyze is your friend > >> > >> > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
I know from when I was testing "SELECT ... WHERE id IN (...,...)" I got a buffer over flow or similar exception from the postgresql native process at about 200-300 longs in the IN clause. Could do with finding what the real limits are so I can protect against it. I will have a dig though the documentation later see what I can find :-) Thanks Jasper Oliver Jowett wrote: >Jasper Potts wrote: > > > >>The big question here is whats the max length of a query sent though >>JDBC to Postgresql??? >> >> > >You'll hit backend limits (I'm not sure if there are any explicit limits >there other than available memory) before you hit any protocol limits; >the protocol limits kick in at around 1GB.. > >-O > >
Jasper Potts wrote: > I know from when I was testing "SELECT ... WHERE id IN (...,...)" I got > a buffer over flow or similar exception from the postgresql native > process at about 200-300 longs in the IN clause. Well I'd need to see the actual exception not a vague description of it to comment further. > Could do with finding > what the real limits are so I can protect against it. I will have a dig > though the documentation later see what I can find :-) There may be a (backend) limit on the number of parameters in a query, not sure. -O
Oliver Jowett <oliver@opencloud.com> writes: > There may be a (backend) limit on the number of parameters in a query, > not sure. I'm pretty certain there's no hard limit. However, I doubt anyone has tested with thousands of parameters ... there might well be performance issues, such as loops that are O(N^2) in the number of parameters. Feel free to send along problem cases. regards, tom lane
Tom Lane wrote: > Oliver Jowett <oliver@opencloud.com> writes: > >>There may be a (backend) limit on the number of parameters in a query, >>not sure. > > > I'm pretty certain there's no hard limit. However, I doubt anyone has > tested with thousands of parameters ... there might well be performance > issues, such as loops that are O(N^2) in the number of parameters. > Feel free to send along problem cases. With a really big IN list (16384 triggered it, 8192 didn't) I get "ERROR: stack depth limit exceeded", though I guess this is more a limit in the handling of IN than a number-of-parameters limit per se. -O
Oliver Jowett wrote: > With a really big IN list (16384 triggered it, 8192 didn't) I get > "ERROR: stack depth limit exceeded", though I guess this is more a limit > in the handling of IN than a number-of-parameters limit per se. And the end of the stack trace is this: (... lots of transformExpr() recursion ...) #11909 0x000000000048d4d2 in transformExpr () #11910 0x000000000048d4d2 in transformExpr () #11911 0x000000000048d4d2 in transformExpr () #11912 0x000000000048aa0e in transformWhereClause () #11913 0x0000000000479efc in transformStmt () #11914 0x000000000047b596 in do_parse_analyze () #11915 0x000000000047cad9 in parse_analyze_varparams () #11916 0x00000000005446f5 in exec_parse_message () #11917 0x0000000000545fc3 in PostgresMain () #11918 0x000000000051da0d in ServerLoop () #11919 0x000000000051ee6f in PostmasterMain () #11920 0x00000000004ee0af in main () -O