Thread: IN list processing performance (yet again)
Having grepped the web, it's clear that this isn't the first or last time this issue will be raised. My application relies heavily on IN lists. The lists are primarily constant integers, so queries look like: SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...) Performance is critical, and the size of these lists depends a lot on how the larger 3-tier applicaiton is used, but it wouldn't be out of the question to retrieve 3000-10000 items. PostgreSQL 7.3.2 seems to have a lot of trouble with large lists. I ran an experiment that ran queries on a table of two integers (ID, VAL), where ID is a primary key and the subject of IN list predicates. The test used a table with one million rows ID is appropriately indexed, and I have VACUUMED/analyzed the database after table load. I ran tests on in-lists from about 100 to 100,000 entries. I also ran tests where I picked the rows out one-by-one using parameterized statements, e.g. SELECT val FROM table WHERE id = ? I'm trying to decide how much I should use parameterized statements and when to work around buffer size limitations in JDBC transport, general query processing, etc. So here are my questions as a prelude to the data noted below: 1) What is the acceptable limit of jdbc Statement buffer sizes for executeQuery()? (Presumably it's not really a JDBC question, but a PostgreSQL query/buffering/transport question). 2) What is the expected acceptable limit for the number of items in an IN list predicate such as those used here. (List of constants, not subselects). 3) What should I expect for performance capabilities/limitations from PostgreSQL on this type of problem? (Set my expectations, perhaps they're unrealistic). 4) What are my alternatives for picking specific rows for thousands of elements with sub-second response times if not IN lists? (This is crucial!) --------------------------------------------- Here is a summary of my observations of query times, and I've attached the test program (more on that below). 1) PostgreSQL exhibits worse-than-linear performance behavior with respect to IN list size. This is bad. 2) Parameterized statements exhibit the expected linear performance characteristics. 3) The break-even point for using IN lists vs. parameterized statements in my environment (RedHat Linux 9.0, PostgreSQL 7.3.2, 512MB memory, 7200RPM 100UDMA IDE disks, AMD1600Mhz) is about 700 items in the IN list. Beyond that, IN the IN list scalability curve makes it impractical. 4) God help you if you haven't vacuum/analyzed that the newly loaded table. Without this, IN list processing is even worse! For just 10 elements in the IN list: *Without* VACUUMDB, IN lists suck beyond measure: Elapsed time for 10 IN list elements: 2638 ms Elapsed time for 10 parameterized elements: 9 ms *With* VACUUMDB: IN lists recover a bit: Elapsed time for 10 IN list elements: 10 ms Elapsed time for 10 parameterized elements: 24 ms However it's VERY interesting to note that parameterized statements worked well. That implies probable disparity in plan generation. It's worth noting that it didn't *feel* like I was getting the same delay when I ran the query from the 'psql' client, but since it doesn't report times I can't be sure. 5) Rest of my results are vacuumed, (and listed in the attached program in detail). The interesting points are: For an IN list of 700 elements: MySQL 3.23.56 (with INNODB tables) takes 19ms, 73ms with parameterized statements. PostgreSQL takes 269ms, 263ms with parameterized statements. For larger lists, MySQL happily processed a 90,000 element IN list in 1449ms, 9283 ms using parameterized statements. PostgreSQL craps out trying to process 8000 elements with the error: out of free buffers: time to abort! PostgreSQL takes 45,566ms for 7000 elements in an IN list (ouch!) , and 2027ms for a parameterized statement. MySQL easily beats that with 10 times the data. 6) Using a remote client on the lan (10/100) to run the client piece on a separate machine from the database server yielded little change int he results. Prepared statements worked pretty well even with actual wire latency, surprise! (Of course it's very little latency on my lan, not like, say, the database server running in a different city which customers have been known to do). The MySQL and PostgreSQL installations are the default RedHat 9.0 distribution packages, I haven't tweaked either's installation parameters. (though MySQL was updated by RedHat from 3.23.54a to 3.23.56 as part of an automatic upgrade). My goal here isn't to say "why aren't you like MySQL". I've been using PostgreSQL for a year as the development database of choice with satisfactory results. But I won't be able to support customers who want to use PostgreSQL on large deployments of my products if I can't selectively retrieve data for several thousand elements in sub-second times. (PostgreSQL devos, if you want a feel good bullet, note that I don't support MySQL at all since lack of MVCC transactions is a showstopper from a multi-user performance standpoint). So I'm looking for (a) solutions, answers to my questions above, and (b) a statement of "we're on this" or "you're stuck with it" from PostgreSQL devos who know. ---------------------------------------------------------------- On the attached program. (a Java JDBC program) Sorry, you can't just run it "as is". Search for formfeeds (^L) if you want to skip all the result data I logged. Compilation is straightforward, simply supply the location of your JDBC jar file for compiling and running (mine is noted in the program). First, move the "if (false)" element below the table creation statements and run the program to create the table. Then VACUUM/analyze your database (suuuuure would be nice not to have to vacuum). Then move the "if (false)" element above the table creation so you won't have to do it every time. Then move past the formfeed and adjust the 'tryAmount' for loop to test the amounts you're interested. 100 to 1000 by 100's is a good starting point. Ignore the section (part of the if (false) logic) that attempts to figure out what the largest query size is. Unless you want to reproduce a hung postmaster in a CPU loop, which is what I got when I tried to run that logic, though whan I ran it I was missing the closing ')' in my IN list, which I've since added to that code. Thanks for any help! Dave import java.sql.* ; /* Query huge inlists, test for buffer overflow and performance using in-lists vs. parameterized statements, and perhaps batched statements (not yet) Results: MySQL blows PostgreSQL away here, and that's AFTER doing a VACUUMDB in PostgreSQL PostgreSQL does ok with VACUUMDB to 700 entries, the prepared statements are better. MySQL: IN lists are always superior to any practical limit (tested to 100,000 elements) Local-to-local on AMD1600Mhz, MySQL 3.23.56 with JDBC driver 3.0.8 [MYSQL was much better here...] Table update time: 109927 ms For 10,000 -> 100,000 by 10,000 Elapsed time for 90000 IN list elements: 1449 ms 62 elements/ms Elapsed time for 90000 parameterized elements: 9283 ms 9 elements/ms For 100 -> 10,000 by 100 Last received via IN list: 99 Elapsed time for 100 IN list elements: 6 ms 16 elements/ms Elapsed time for 100 parameterized elements: 20 ms 5 elements/ms Last received via parameterized elements: 99 Last received via IN list: 199 Elapsed time for 200 IN list elements: 7 ms 28 elements/ms Elapsed time for 200 parameterized elements: 45 ms 4 elements/ms Last received via parameterized elements: 199 Last received via IN list: 299 Elapsed time for 300 IN list elements: 22 ms 13 elements/ms Elapsed time for 300 parameterized elements: 64 ms 4 elements/ms Last received via parameterized elements: 299 Last received via IN list: 399 Elapsed time for 400 IN list elements: 39 ms 10 elements/ms Elapsed time for 400 parameterized elements: 63 ms 6 elements/ms Last received via parameterized elements: 399 Last received via IN list: 499 Elapsed time for 500 IN list elements: 12 ms 41 elements/ms Elapsed time for 500 parameterized elements: 116 ms 4 elements/ms Last received via parameterized elements: 499 Last received via IN list: 599 Elapsed time for 600 IN list elements: 8 ms 75 elements/ms Elapsed time for 600 parameterized elements: 65 ms 9 elements/ms Last received via parameterized elements: 599 Last received via IN list: 699 Elapsed time for 700 IN list elements: 19 ms 36 elements/ms Elapsed time for 700 parameterized elements: 73 ms 9 elements/ms Last received via parameterized elements: 699 Last received via IN list: 799 Elapsed time for 800 IN list elements: 12 ms 66 elements/ms Elapsed time for 800 parameterized elements: 84 ms 9 elements/ms Last received via parameterized elements: 799 Last received via IN list: 899 Elapsed time for 900 IN list elements: 12 ms 75 elements/ms Elapsed time for 900 parameterized elements: 97 ms 9 elements/ms Last received via parameterized elements: 899 Last received via IN list: 999 Elapsed time for 1000 IN list elements: 15 ms 66 elements/ms Elapsed time for 1000 parameterized elements: 109 ms 9 elements/ms Last received via parameterized elements: 999 Last received via IN list: 1099 Elapsed time for 1100 IN list elements: 16 ms 68 elements/ms Elapsed time for 1100 parameterized elements: 115 ms 9 elements/ms Last received via parameterized elements: 1099 Last received via IN list: 1199 Elapsed time for 1200 IN list elements: 16 ms 75 elements/ms Elapsed time for 1200 parameterized elements: 138 ms 8 elements/ms Last received via parameterized elements: 1199 Last received via IN list: 1299 Elapsed time for 1300 IN list elements: 19 ms 68 elements/ms Elapsed time for 1300 parameterized elements: 136 ms 9 elements/ms Last received via parameterized elements: 1299 Last received via IN list: 1399 Elapsed time for 1400 IN list elements: 21 ms 66 elements/ms Elapsed time for 1400 parameterized elements: 150 ms 9 elements/ms Last received via parameterized elements: 1399 Last received via IN list: 1499 Elapsed time for 1500 IN list elements: 21 ms 71 elements/ms Elapsed time for 1500 parameterized elements: 158 ms 9 elements/ms Last received via parameterized elements: 1499 Last received via IN list: 1599 Elapsed time for 1600 IN list elements: 22 ms 72 elements/ms Elapsed time for 1600 parameterized elements: 178 ms 8 elements/ms Last received via parameterized elements: 1599 Last received via IN list: 1699 Elapsed time for 1700 IN list elements: 22 ms 77 elements/ms Elapsed time for 1700 parameterized elements: 180 ms 9 elements/ms Last received via parameterized elements: 1699 Last received via IN list: 1799 Elapsed time for 1800 IN list elements: 26 ms 69 elements/ms Elapsed time for 1800 parameterized elements: 205 ms 8 elements/ms Last received via parameterized elements: 1799 Last received via IN list: 1899 Elapsed time for 1900 IN list elements: 57 ms 33 elements/ms Elapsed time for 1900 parameterized elements: 198 ms 9 elements/ms Last received via parameterized elements: 1899 Last received via IN list: 1999 Elapsed time for 2000 IN list elements: 29 ms 68 elements/ms Elapsed time for 2000 parameterized elements: 215 ms 9 elements/ms Last received via parameterized elements: 1999 Last received via IN list: 2099 Elapsed time for 2100 IN list elements: 30 ms 70 elements/ms Elapsed time for 2100 parameterized elements: 222 ms 9 elements/ms Last received via parameterized elements: 2099 Last received via IN list: 2199 Elapsed time for 2200 IN list elements: 32 ms 68 elements/ms Elapsed time for 2200 parameterized elements: 243 ms 9 elements/ms Last received via parameterized elements: 2199 Last received via IN list: 2299 Elapsed time for 2300 IN list elements: 32 ms 71 elements/ms Elapsed time for 2300 parameterized elements: 241 ms 9 elements/ms Last received via parameterized elements: 2299 Last received via IN list: 2399 Elapsed time for 2400 IN list elements: 33 ms 72 elements/ms Elapsed time for 2400 parameterized elements: 260 ms 9 elements/ms Last received via parameterized elements: 2399 Last received via IN list: 2499 Elapsed time for 2500 IN list elements: 64 ms 39 elements/ms Elapsed time for 2500 parameterized elements: 275 ms 9 elements/ms Last received via parameterized elements: 2499 Last received via IN list: 2599 Elapsed time for 2600 IN list elements: 35 ms 74 elements/ms Elapsed time for 2600 parameterized elements: 275 ms 9 elements/ms Last received via parameterized elements: 2599 Last received via IN list: 2699 Elapsed time for 2700 IN list elements: 36 ms 75 elements/ms Elapsed time for 2700 parameterized elements: 292 ms 9 elements/ms Last received via parameterized elements: 2699 Last received via IN list: 2799 Elapsed time for 2800 IN list elements: 41 ms 68 elements/ms Elapsed time for 2800 parameterized elements: 308 ms 9 elements/ms Last received via parameterized elements: 2799 Last received via IN list: 2899 Elapsed time for 2900 IN list elements: 40 ms 72 elements/ms Elapsed time for 2900 parameterized elements: 341 ms 8 elements/ms Last received via parameterized elements: 2899 Last received via IN list: 2999 Elapsed time for 3000 IN list elements: 42 ms 71 elements/ms Elapsed time for 3000 parameterized elements: 356 ms 8 elements/ms Last received via parameterized elements: 2999 Last received via IN list: 3099 Elapsed time for 3100 IN list elements: 42 ms 73 elements/ms Elapsed time for 3100 parameterized elements: 342 ms 9 elements/ms Last received via parameterized elements: 3099 Last received via IN list: 3199 Elapsed time for 3200 IN list elements: 43 ms 74 elements/ms Elapsed time for 3200 parameterized elements: 345 ms 9 elements/ms Last received via parameterized elements: 3199 Last received via IN list: 3299 Elapsed time for 3300 IN list elements: 43 ms 76 elements/ms Elapsed time for 3300 parameterized elements: 359 ms 9 elements/ms Last received via parameterized elements: 3299 Last received via IN list: 3399 Elapsed time for 3400 IN list elements: 77 ms 44 elements/ms Elapsed time for 3400 parameterized elements: 356 ms 9 elements/ms Last received via parameterized elements: 3399 Last received via IN list: 3499 Elapsed time for 3500 IN list elements: 50 ms 70 elements/ms Elapsed time for 3500 parameterized elements: 370 ms 9 elements/ms Last received via parameterized elements: 3499 Last received via IN list: 3599 Elapsed time for 3600 IN list elements: 84 ms 42 elements/ms Elapsed time for 3600 parameterized elements: 387 ms 9 elements/ms Last received via parameterized elements: 3599 Last received via IN list: 3699 Elapsed time for 3700 IN list elements: 52 ms 71 elements/ms Elapsed time for 3700 parameterized elements: 390 ms 9 elements/ms Last received via parameterized elements: 3699 Last received via IN list: 3799 Elapsed time for 3800 IN list elements: 54 ms 70 elements/ms Elapsed time for 3800 parameterized elements: 407 ms 9 elements/ms Last received via parameterized elements: 3799 Last received via IN list: 3899 Elapsed time for 3900 IN list elements: 84 ms 46 elements/ms Elapsed time for 3900 parameterized elements: 422 ms 9 elements/ms Last received via parameterized elements: 3899 Last received via IN list: 3999 Elapsed time for 4000 IN list elements: 58 ms 68 elements/ms Elapsed time for 4000 parameterized elements: 434 ms 9 elements/ms Last received via parameterized elements: 3999 Last received via IN list: 4099 Elapsed time for 4100 IN list elements: 58 ms 70 elements/ms Elapsed time for 4100 parameterized elements: 446 ms 9 elements/ms Last received via parameterized elements: 4099 Last received via IN list: 4199 Elapsed time for 4200 IN list elements: 61 ms 68 elements/ms Elapsed time for 4200 parameterized elements: 453 ms 9 elements/ms Last received via parameterized elements: 4199 Last received via IN list: 4299 Elapsed time for 4300 IN list elements: 63 ms 68 elements/ms Elapsed time for 4300 parameterized elements: 457 ms 9 elements/ms Last received via parameterized elements: 4299 Last received via IN list: 4399 Elapsed time for 4400 IN list elements: 94 ms 46 elements/ms Elapsed time for 4400 parameterized elements: 473 ms 9 elements/ms Last received via parameterized elements: 4399 Last received via IN list: 4499 Elapsed time for 4500 IN list elements: 64 ms 70 elements/ms Elapsed time for 4500 parameterized elements: 506 ms 8 elements/ms Last received via parameterized elements: 4499 Last received via IN list: 4599 Elapsed time for 4600 IN list elements: 72 ms 63 elements/ms Elapsed time for 4600 parameterized elements: 527 ms 8 elements/ms Last received via parameterized elements: 4599 Last received via IN list: 4699 Elapsed time for 4700 IN list elements: 68 ms 69 elements/ms Elapsed time for 4700 parameterized elements: 499 ms 9 elements/ms Last received via parameterized elements: 4699 Last received via IN list: 4799 Elapsed time for 4800 IN list elements: 70 ms 68 elements/ms Elapsed time for 4800 parameterized elements: 539 ms 8 elements/ms Last received via parameterized elements: 4799 Last received via IN list: 4899 Elapsed time for 4900 IN list elements: 100 ms 49 elements/ms Elapsed time for 4900 parameterized elements: 590 ms 8 elements/ms Last received via parameterized elements: 4899 Last received via IN list: 4999 Elapsed time for 5000 IN list elements: 111 ms 45 elements/ms Elapsed time for 5000 parameterized elements: 531 ms 9 elements/ms Last received via parameterized elements: 4999 Last received via IN list: 5099 Elapsed time for 5100 IN list elements: 79 ms 64 elements/ms Elapsed time for 5100 parameterized elements: 542 ms 9 elements/ms Last received via parameterized elements: 5099 Last received via IN list: 5199 Elapsed time for 5200 IN list elements: 108 ms 48 elements/ms Elapsed time for 5200 parameterized elements: 560 ms 9 elements/ms Last received via parameterized elements: 5199 Last received via IN list: 5299 Elapsed time for 5300 IN list elements: 78 ms 67 elements/ms Elapsed time for 5300 parameterized elements: 560 ms 9 elements/ms Last received via parameterized elements: 5299 Last received via IN list: 5399 Elapsed time for 5400 IN list elements: 81 ms 66 elements/ms Elapsed time for 5400 parameterized elements: 572 ms 9 elements/ms Last received via parameterized elements: 5399 Last received via IN list: 5499 Elapsed time for 5500 IN list elements: 80 ms 68 elements/ms Elapsed time for 5500 parameterized elements: 627 ms 8 elements/ms Last received via parameterized elements: 5499 Last received via IN list: 5599 Elapsed time for 5600 IN list elements: 81 ms 69 elements/ms Elapsed time for 5600 parameterized elements: 590 ms 9 elements/ms Last received via parameterized elements: 5599 Last received via IN list: 5699 Elapsed time for 5700 IN list elements: 86 ms 66 elements/ms Elapsed time for 5700 parameterized elements: 606 ms 9 elements/ms Last received via parameterized elements: 5699 Last received via IN list: 5799 Elapsed time for 5800 IN list elements: 85 ms 68 elements/ms Elapsed time for 5800 parameterized elements: 663 ms 8 elements/ms Last received via parameterized elements: 5799 Last received via IN list: 5899 Elapsed time for 5900 IN list elements: 83 ms 71 elements/ms Elapsed time for 5900 parameterized elements: 629 ms 9 elements/ms Last received via parameterized elements: 5899 Last received via IN list: 5999 Elapsed time for 6000 IN list elements: 91 ms 65 elements/ms Elapsed time for 6000 parameterized elements: 658 ms 9 elements/ms Last received via parameterized elements: 5999 Last received via IN list: 6099 Elapsed time for 6100 IN list elements: 87 ms 70 elements/ms Elapsed time for 6100 parameterized elements: 702 ms 8 elements/ms Last received via parameterized elements: 6099 Last received via IN list: 6199 Elapsed time for 6200 IN list elements: 88 ms 70 elements/ms Elapsed time for 6200 parameterized elements: 667 ms 9 elements/ms Last received via parameterized elements: 6199 Last received via IN list: 6299 Elapsed time for 6300 IN list elements: 91 ms 69 elements/ms Elapsed time for 6300 parameterized elements: 670 ms 9 elements/ms Last received via parameterized elements: 6299 Last received via IN list: 6399 Elapsed time for 6400 IN list elements: 91 ms 70 elements/ms Elapsed time for 6400 parameterized elements: 708 ms 9 elements/ms Last received via parameterized elements: 6399 Last received via IN list: 6499 Elapsed time for 6500 IN list elements: 96 ms 67 elements/ms Elapsed time for 6500 parameterized elements: 687 ms 9 elements/ms Last received via parameterized elements: 6499 Last received via IN list: 6599 Elapsed time for 6600 IN list elements: 95 ms 69 elements/ms Elapsed time for 6600 parameterized elements: 701 ms 9 elements/ms Last received via parameterized elements: 6599 Last received via IN list: 6699 Elapsed time for 6700 IN list elements: 134 ms 50 elements/ms Elapsed time for 6700 parameterized elements: 711 ms 9 elements/ms Last received via parameterized elements: 6699 Last received via IN list: 6799 Elapsed time for 6800 IN list elements: 98 ms 69 elements/ms Elapsed time for 6800 parameterized elements: 723 ms 9 elements/ms Last received via parameterized elements: 6799 Last received via IN list: 6899 Elapsed time for 6900 IN list elements: 99 ms 69 elements/ms Elapsed time for 6900 parameterized elements: 728 ms 9 elements/ms Last received via parameterized elements: 6899 Last received via IN list: 6999 Elapsed time for 7000 IN list elements: 141 ms 49 elements/ms Elapsed time for 7000 parameterized elements: 733 ms 9 elements/ms Last received via parameterized elements: 6999 Last received via IN list: 7099 Elapsed time for 7100 IN list elements: 108 ms 65 elements/ms Elapsed time for 7100 parameterized elements: 752 ms 9 elements/ms Last received via parameterized elements: 7099 Last received via IN list: 7199 Elapsed time for 7200 IN list elements: 104 ms 69 elements/ms Elapsed time for 7200 parameterized elements: 769 ms 9 elements/ms Last received via parameterized elements: 7199 Last received via IN list: 7299 Elapsed time for 7300 IN list elements: 142 ms 51 elements/ms Elapsed time for 7300 parameterized elements: 778 ms 9 elements/ms Last received via parameterized elements: 7299 Last received via IN list: 7399 Elapsed time for 7400 IN list elements: 127 ms 58 elements/ms Elapsed time for 7400 parameterized elements: 796 ms 9 elements/ms Last received via parameterized elements: 7399 Last received via IN list: 7499 Elapsed time for 7500 IN list elements: 110 ms 68 elements/ms Elapsed time for 7500 parameterized elements: 796 ms 9 elements/ms Last received via parameterized elements: 7499 Last received via IN list: 7599 Elapsed time for 7600 IN list elements: 144 ms 52 elements/ms Elapsed time for 7600 parameterized elements: 994 ms 7 elements/ms Last received via parameterized elements: 7599 Last received via IN list: 7699 Elapsed time for 7700 IN list elements: 122 ms 63 elements/ms Elapsed time for 7700 parameterized elements: 819 ms 9 elements/ms Last received via parameterized elements: 7699 Last received via IN list: 7799 Elapsed time for 7800 IN list elements: 114 ms 68 elements/ms Elapsed time for 7800 parameterized elements: 893 ms 8 elements/ms Last received via parameterized elements: 7799 Last received via IN list: 7899 Elapsed time for 7900 IN list elements: 152 ms 51 elements/ms Elapsed time for 7900 parameterized elements: 841 ms 9 elements/ms Last received via parameterized elements: 7899 Last received via IN list: 7999 Elapsed time for 8000 IN list elements: 119 ms 67 elements/ms Elapsed time for 8000 parameterized elements: 853 ms 9 elements/ms Last received via parameterized elements: 7999 Last received via IN list: 8099 Elapsed time for 8100 IN list elements: 121 ms 66 elements/ms Elapsed time for 8100 parameterized elements: 858 ms 9 elements/ms Last received via parameterized elements: 8099 Last received via IN list: 8199 Elapsed time for 8200 IN list elements: 157 ms 52 elements/ms Elapsed time for 8200 parameterized elements: 871 ms 9 elements/ms Last received via parameterized elements: 8199 Last received via IN list: 8299 Elapsed time for 8300 IN list elements: 131 ms 63 elements/ms Elapsed time for 8300 parameterized elements: 879 ms 9 elements/ms Last received via parameterized elements: 8299 Last received via IN list: 8399 Elapsed time for 8400 IN list elements: 124 ms 67 elements/ms Elapsed time for 8400 parameterized elements: 890 ms 9 elements/ms Last received via parameterized elements: 8399 Last received via IN list: 8499 Elapsed time for 8500 IN list elements: 163 ms 52 elements/ms Elapsed time for 8500 parameterized elements: 904 ms 9 elements/ms Last received via parameterized elements: 8499 Last received via IN list: 8599 Elapsed time for 8600 IN list elements: 131 ms 65 elements/ms Elapsed time for 8600 parameterized elements: 907 ms 9 elements/ms Last received via parameterized elements: 8599 Last received via IN list: 8699 Elapsed time for 8700 IN list elements: 127 ms 68 elements/ms Elapsed time for 8700 parameterized elements: 947 ms 9 elements/ms Last received via parameterized elements: 8699 Last received via IN list: 8799 Elapsed time for 8800 IN list elements: 166 ms 53 elements/ms Elapsed time for 8800 parameterized elements: 933 ms 9 elements/ms Last received via parameterized elements: 8799 Last received via IN list: 8899 Elapsed time for 8900 IN list elements: 133 ms 66 elements/ms Elapsed time for 8900 parameterized elements: 947 ms 9 elements/ms Last received via parameterized elements: 8899 Last received via IN list: 8999 Elapsed time for 9000 IN list elements: 142 ms 63 elements/ms Elapsed time for 9000 parameterized elements: 951 ms 9 elements/ms Last received via parameterized elements: 8999 Last received via IN list: 9099 Elapsed time for 9100 IN list elements: 173 ms 52 elements/ms Elapsed time for 9100 parameterized elements: 961 ms 9 elements/ms Last received via parameterized elements: 9099 Last received via IN list: 9199 Elapsed time for 9200 IN list elements: 138 ms 66 elements/ms Elapsed time for 9200 parameterized elements: 979 ms 9 elements/ms Last received via parameterized elements: 9199 Last received via IN list: 9299 Elapsed time for 9300 IN list elements: 135 ms 68 elements/ms Elapsed time for 9300 parameterized elements: 985 ms 9 elements/ms Last received via parameterized elements: 9299 Last received via IN list: 9399 Elapsed time for 9400 IN list elements: 175 ms 53 elements/ms Elapsed time for 9400 parameterized elements: 1000 ms 9 elements/ms Last received via parameterized elements: 9399 Last received via IN list: 9499 Elapsed time for 9500 IN list elements: 136 ms 69 elements/ms Elapsed time for 9500 parameterized elements: 1010 ms 9 elements/ms Last received via parameterized elements: 9499 Last received via IN list: 9599 Elapsed time for 9600 IN list elements: 144 ms 66 elements/ms Elapsed time for 9600 parameterized elements: 1017 ms 9 elements/ms Last received via parameterized elements: 9599 Last received via IN list: 9699 Elapsed time for 9700 IN list elements: 186 ms 52 elements/ms Elapsed time for 9700 parameterized elements: 1021 ms 9 elements/ms Last received via parameterized elements: 9699 Last received via IN list: 9799 Elapsed time for 9800 IN list elements: 142 ms 69 elements/ms Elapsed time for 9800 parameterized elements: 1048 ms 9 elements/ms Last received via parameterized elements: 9799 Last received via IN list: 9899 Elapsed time for 9900 IN list elements: 145 ms 68 elements/ms Elapsed time for 9900 parameterized elements: 1054 ms 9 elements/ms Last received via parameterized elements: 9899 Local-to-local on AMD 1600Mhz, PGSQL7.3.2, RH9.0 Table update time: 169xxx ms *Without* VACUUMDB, IN lists suck beyond measure: For just 10 elements in the IN list: Elapsed time for 10 IN list elements: 2638 ms 0 elements/ms Elapsed time for 10 parameterized elements: 9 ms 1 elements/ms *With* VACUUMDB: IN lists recover a bit: Elapsed time for 10 IN list elements: 10 ms 1 elements/ms Elapsed time for 10 parameterized elements: 24 ms 0 elements/ms [...] Elapsed time for 99 IN list elements: 16 ms 6 elements/ms Elapsed time for 99 parameterized elements: 36 ms 2 elements/ms Last received via parameterized elements: 98 Testing for some larger numbers of elements, however, reveals another story. Elapsed time for 1000 IN list elements: 676 ms 1 elements/ms Elapsed time for 1000 parameterized elements: 351 ms 2 elements/ms Elapsed time for 2000 IN list elements: 3197 ms 0 elements/ms Elapsed time for 2000 parameterized elements: 713 ms 2 elements/ms Elapsed time for 3000 IN list elements: 7266 ms 0 elements/ms Elapsed time for 3000 parameterized elements: 1080 ms 2 elements/ms Elapsed time for 4000 IN list elements: 13341 ms 0 elements/ms Elapsed time for 4000 parameterized elements: 1449 ms 2 elements/ms Based on the 'by-the-1000's approach, IN lists are on a seriously bad curve, while parameterized statements are linear. The cutoff on my AMD 1600 Mhz machine is at about 700 elements: Elapsed time for 600 IN list elements: 157 ms 3 elements/ms Elapsed time for 600 parameterized elements: 246 ms 2 elements/ms Elapsed time for 700 IN list elements: 269 ms 2 elements/ms Elapsed time for 700 parameterized elements: 263 ms 2 elements/ms Local(Intel PIII-M 1Ghz)-to-REMOTE(AMD1600Mhz), PostgreSQL 7.3.2 100 -> 1000 by 100's. IN List Query string length is 637 Last received via IN list: 99 Elapsed time for 100 IN list elements: 22 ms 4 elements/ms Elapsed time for 100 parameterized elements: 117 ms 0 elements/ms Last received via parameterized elements: 99 IN List Query string length is 1237 Last received via IN list: 199 Elapsed time for 200 IN list elements: 39 ms 5 elements/ms Elapsed time for 200 parameterized elements: 169 ms 1 elements/ms Last received via parameterized elements: 199 IN List Query string length is 1837 Last received via IN list: 299 Elapsed time for 300 IN list elements: 81 ms 3 elements/ms Elapsed time for 300 parameterized elements: 136 ms 2 elements/ms Last received via parameterized elements: 299 IN List Query string length is 2437 Last received via IN list: 399 Elapsed time for 400 IN list elements: 108 ms 3 elements/ms Elapsed time for 400 parameterized elements: 157 ms 2 elements/ms Last received via parameterized elements: 399 IN List Query string length is 3037 Last received via IN list: 499 Elapsed time for 500 IN list elements: 99 ms 5 elements/ms Elapsed time for 500 parameterized elements: 232 ms 2 elements/ms Last received via parameterized elements: 499 IN List Query string length is 3637 Last received via IN list: 599 Elapsed time for 600 IN list elements: 150 ms 4 elements/ms Elapsed time for 600 parameterized elements: 182 ms 3 elements/ms Last received via parameterized elements: 599 IN List Query string length is 4237 Last received via IN list: 699 Elapsed time for 700 IN list elements: 256 ms 2 elements/ms Elapsed time for 700 parameterized elements: 219 ms 3 elements/ms Last received via parameterized elements: 699 IN List Query string length is 4837 Last received via IN list: 799 Elapsed time for 800 IN list elements: 370 ms 2 elements/ms Elapsed time for 800 parameterized elements: 245 ms 3 elements/ms Last received via parameterized elements: 799 IN List Query string length is 5437 Last received via IN list: 899 Elapsed time for 900 IN list elements: 511 ms 1 elements/ms Elapsed time for 900 parameterized elements: 265 ms 3 elements/ms Last received via parameterized elements: 899 Curiously, the difference between local->local and local->remote here is small. The breakeven point was about 650 elements, versus slightly closer to 700 in local->local. Local(Intel PIII-M 1Ghz)-to-REMOTE(AMD1600Mhz), PostgreSQL 7.3.2 1000 -> 10000 by 1000's. IN List Query string length is 6037 Last received via IN list: 999 Elapsed time for 1000 IN list elements: 772 ms 1 elements/ms Elapsed time for 1000 parameterized elements: 495 ms 2 elements/ms Last received via parameterized elements: 999 IN List Query string length is 12037 Last received via IN list: 1999 Elapsed time for 2000 IN list elements: 3055 ms 0 elements/ms Elapsed time for 2000 parameterized elements: 681 ms 2 elements/ms Last received via parameterized elements: 1999 IN List Query string length is 18037 Last received via IN list: 2999 Elapsed time for 3000 IN list elements: 7205 ms 0 elements/ms Elapsed time for 3000 parameterized elements: 888 ms 3 elements/ms Last received via parameterized elements: 2999 IN List Query string length is 24037 Last received via IN list: 3999 Elapsed time for 4000 IN list elements: 12912 ms 0 elements/ms Elapsed time for 4000 parameterized elements: 1162 ms 3 elements/ms Last received via parameterized elements: 3999 IN List Query string length is 30037 Last received via IN list: 4999 Elapsed time for 5000 IN list elements: 21844 ms 0 elements/ms Elapsed time for 5000 parameterized elements: 1461 ms 3 elements/ms Last received via parameterized elements: 4999 IN List Query string length is 36037 Last received via IN list: 5999 Elapsed time for 6000 IN list elements: 32595 ms 0 elements/ms Elapsed time for 6000 parameterized elements: 1780 ms 3 elements/ms Last received via parameterized elements: 5999 IN List Query string length is 42037 Last received via IN list: 6999 Elapsed time for 7000 IN list elements: 46566 ms 0 elements/ms Elapsed time for 7000 parameterized elements: 2027 ms 3 elements/ms Last received via parameterized elements: 6999 IN List Query string length is 48037 java.sql.SQLException: ERROR: out of free buffers: time to abort! at org.postgresql.core.QueryExecutor.execute(QueryExecutor.java:126) at org.postgresql.jdbc1.AbstractJdbc1Connection.ExecSQL(AbstractJdbc1Connection.java:451) at org.postgresql.jdbc1.AbstractJdbc1Statement.execute(AbstractJdbc1Statement.java:281) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:48) at org.postgresql.jdbc1.AbstractJdbc1Statement.executeQuery(AbstractJdbc1Statement.java:144) at org.postgresql.jdbc1.AbstractJdbc1Statement.executeQuery(AbstractJdbc1Statement.java:132) at INList.main(INList.java:775) */ class INList { public static void main(String[] args) { try { final int NROWS = 1000000 ; // one million final int IDSTART = 10000 ; // first ID value // ---------------- MySql driver ----------------- // -classpath .:/usr/java/mysql-connector-java-2.0.14/mysql-connector-java-2.0.14-bin.jar // RH9: -classpath .:/home/dave/mysql-connector-java-3.0.8-stable/mysql-connector-java-3.0.8-stable-bin.jar //Class.forName("com.mysql.jdbc.Driver") ; //Connection conn = DriverManager.getConnection("jdbc:mysql:///test", "dave", "") ; //String innodb = " TYPE=INNODB" ; // --------------- postgres 7.3.2 distribution JDBC2 driver linked as /usr/java/postgresql-jdbc.jar // -classpath .:/usr/java/postgresql-jdbc.jar Class.forName("org.postgresql.Driver") ; Connection conn = DriverManager.getConnection("jdbc:postgresql:test", "dave", "") ; String innodb = "" ; conn.setAutoCommit(false) ; long oldmillis, newmillis, elapsed ; Statement statement = conn.createStatement() ; PreparedStatement pstatement ; ResultSet rs ; int lastReceived ; if (false) { statement.executeUpdate("CREATE TABLE MILLION (ID INTEGER PRIMARY KEY, VAL INTEGER)" + innodb) ; conn.commit() ; // Read that "all statements of a connection commit/rollback as a group, API certainly implies this // conn.commit() ; pstatement should potentially fail because table isn't there without this! // Create the rows pstatement = conn.prepareStatement("INSERT INTO MILLION (ID, VAL) VALUES(?, ?)") ; oldmillis = System.currentTimeMillis() ; for (int i = 0 ; i < NROWS ; i++) { pstatement.setInt(1, IDSTART+i) ; pstatement.setInt(2, i) ; pstatement.executeUpdate() ; } conn.commit() ; newmillis = System.currentTimeMillis() ; pstatement.close() ; System.out.println("Table update time: " + (newmillis-oldmillis) + " ms") ; rs = statement.executeQuery("SELECT COUNT(*) FROM MILLION") ; rs.first() ; if (rs.getInt(1) != NROWS) System.out.println("**** Expected " + (NROWS) + " rows, got " + rs.getInt(1)) ; conn.commit() ; // Determine largest IN LIST StringBuffer sb = new StringBuffer(8000*1024) ; boolean successful = false ; int tryAmount = NROWS ; while (!successful) { sb.setLength(0) ; sb.append("SELECT MAX(VAL) FROM MILLION WHERE ID IN (") ; for (int i = 0 ; i < tryAmount ; i++) { sb.append(i+IDSTART) ; if (i < tryAmount-1) sb.append(',') ; } sb.append(')') ; try { rs = statement.executeQuery(sb.toString()) ; successful = true ; } catch (SQLException e) { System.out.println("IN list failed for " + tryAmount + " entries, buffer size was " + sb.length()) ; tryAmount -= 10000 ; if (tryAmount <= 0) break ; // be done with the while loop } } // while (!successful) conn.commit() ; // more for the next txn than this one, eliminate txn caching System.out.println("Largest IN list: " + tryAmount + " elements.") ; } // if (false) else { for (int tryAmount = 100 ; tryAmount < 1000 ; tryAmount+= 100) { StringBuffer sb = new StringBuffer(tryAmount*20) ; sb.append("SELECT VAL FROM MILLION WHERE ID IN (") ; for (int i = 0 ; i < tryAmount ; i++) { sb.append(i+IDSTART) ; if (i+1 < tryAmount) sb.append(',') ; } sb.append(')') ; // Try query with largest inlist, getting resultset with all values String query = sb.toString() ; System.out.println("IN List Query string length is " + query.length()) ; //query = query.replaceFirst("MAX(VAL)", "VAL") ; oldmillis = System.currentTimeMillis() ; rs = statement.executeQuery(query) ; lastReceived = -1 ; while (rs.next()) lastReceived = rs.getInt(1) ; System.out.println("Last received via IN list: " + lastReceived) ; conn.commit() ; newmillis = System.currentTimeMillis() ; elapsed = newmillis - oldmillis ; System.out.println("Elapsed time for " + tryAmount + " IN list elements: " + elapsed + " ms") ; System.out.println((tryAmount/elapsed) + " elements/ms") ; // Try the query with prepared statements lastReceived = -1 ; pstatement = conn.prepareStatement("SELECT VAL FROM MILLION WHERE ID = ?") ; oldmillis = System.currentTimeMillis() ; for (int i = 0 ; i < tryAmount ; i++) { pstatement.setInt(1, i+IDSTART) ; rs = pstatement.executeQuery() ; if (!rs.first()) System.out.println("Pstatement query failed") ; else lastReceived = rs.getInt(1) ; } conn.commit() ; newmillis = System.currentTimeMillis() ; pstatement.close() ; elapsed = newmillis - oldmillis ; System.out.println("Elapsed time for " + tryAmount + " parameterized elements: " + elapsed + " ms") ; System.out.println((tryAmount/elapsed) + " elements/ms") ; System.out.println("Last received via parameterized elements: " + lastReceived) ; } // for (tryAmount ...) } // else (if (false)) // Done, delete our table // statement.executeUpdate("DROP TABLE MILLION") ; conn.commit() ; } catch (Exception e) { e.printStackTrace() ; } } // main() } // class INList
On Wednesday 28 May 2003 18:21, Dave Tenny wrote: > Having grepped the web, it's clear that this isn't the first or last > time this issue will be raised. > > My application relies heavily on IN lists. The lists are primarily > constant integers, so queries look like: > > SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...) How do you derive this list of number? If it is from same database, can you rewrite the query using a join statement? HTH Shridhar
Dave Tenny wrote: > Having grepped the web, it's clear that this isn't the first or last > time this issue will be raised. > > My application relies heavily on IN lists. The lists are primarily > constant integers, so queries look like: > > SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...) > > Performance is critical, and the size of these lists depends a lot on > how the larger 3-tier applicaiton is used, > but it wouldn't be out of the question to retrieve 3000-10000 items. > > PostgreSQL 7.3.2 seems to have a lot of trouble with large lists. > I ran an experiment that ran queries on a table of two integers (ID, > VAL), where ID is a primary key and the subject > of IN list predicates. The test used a table with one million rows > ID is appropriately indexed, > and I have VACUUMED/analyzed the database after table load. > > I ran tests on in-lists from about 100 to 100,000 entries. Hi Dave, it sounds as if that IN-list is created by the application. I wonder if there are really so many variances and combinations of it or whether you could invent an additional column, which groups all those individual values. If possible, you could reduce your IN list to much fewer values, and probably would get better performance (using an index on that col, of course). Regards, Andreas
> > My application relies heavily on IN lists. The lists are primarily > constant integers, so queries look like: > > SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...) > > Performance is critical, and the size of these lists depends a lot on > how the larger 3-tier applicaiton is used, > but it wouldn't be out of the question to retrieve 3000-10000 items. > > PostgreSQL 7.3.2 seems to have a lot of trouble with large lists. you should rewrite your query if the query is created from an applition: SELECT val FROM table WHERE id between 43 and 100002 AND id IN (43, 49, 1001, 100002, ...) where 43 is the min and 100002 the max of all values. I had this case with postgresql 7.2 and the planner made much smarter choices in my case. Regards, Mario Weilguni
On Wed, 28 May 2003, Dave Tenny wrote: > Having grepped the web, it's clear that this isn't the first or last > time this issue will be raised. > > My application relies heavily on IN lists. The lists are primarily > constant integers, so queries look like: > > SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...) > > Performance is critical, and the size of these lists depends a lot on > how the larger 3-tier applicaiton is used, > but it wouldn't be out of the question to retrieve 3000-10000 items. > > PostgreSQL 7.3.2 seems to have a lot of trouble with large lists. It gets converted into a sequence like col=list[0] or col=list[1] and it seems the planner/optimizer is taking at least a large amount of time for me given that explain takes just over 80 seconds for a 9900 item list on my machine (I don't have a data filled table to run the actual query against). The best plan may be generated right now from making a temporary table, copying the values into it, and joining. > 2) What is the expected acceptable limit for the number of items in an > IN list predicate such as > those used here. (List of constants, not subselects). As a note, 7.4 by default seems to limit it to 10000 unless you up max_expr_depth afaics.
A join isn't an option, these elements come a a selection of entity ID's that are specific to some client context.
Some other people suggested joins too.
Consider it something like this, say there's a database that represents the entire file system content
of a set of machines, hundreds of thousands of files. A single user wants to do something
related to the ID's of 3000 files. The requests for those 3000 files can be built up in a number of ways,
not all of which rely on data in the database. So I need to retrieve data on those 3000 files using IN lists or some alternative.
Dave
Shridhar Daithankar wrote:
Some other people suggested joins too.
Consider it something like this, say there's a database that represents the entire file system content
of a set of machines, hundreds of thousands of files. A single user wants to do something
related to the ID's of 3000 files. The requests for those 3000 files can be built up in a number of ways,
not all of which rely on data in the database. So I need to retrieve data on those 3000 files using IN lists or some alternative.
Dave
Shridhar Daithankar wrote:
On Wednesday 28 May 2003 18:21, Dave Tenny wrote:Having grepped the web, it's clear that this isn't the first or last time this issue will be raised. My application relies heavily on IN lists. The lists are primarily constant integers, so queries look like: SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)How do you derive this list of number? If it is from same database, can you rewrite the query using a join statement? HTH Shridhar ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster
Andreas Pflug wrote: > Dave Tenny wrote: > >> Having grepped the web, it's clear that this isn't the first or last >> time this issue will be raised. >> >> My application relies heavily on IN lists. The lists are primarily >> constant integers, so queries look like: >> >> SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...) >> >> Performance is critical, and the size of these lists depends a lot on >> how the larger 3-tier applicaiton is used, >> but it wouldn't be out of the question to retrieve 3000-10000 items. >> >> PostgreSQL 7.3.2 seems to have a lot of trouble with large lists. >> I ran an experiment that ran queries on a table of two integers (ID, >> VAL), where ID is a primary key and the subject >> of IN list predicates. The test used a table with one million rows >> ID is appropriately indexed, >> and I have VACUUMED/analyzed the database after table load. >> >> I ran tests on in-lists from about 100 to 100,000 entries. > > > Hi Dave, > > it sounds as if that IN-list is created by the application. I wonder > if there are really so many variances and combinations of it or > whether you could invent an additional column, which groups all those > individual values. If possible, you could reduce your IN list to much > fewer values, and probably would get better performance (using an > index on that col, of course). There are over 50 tables in the schema, and dozens of client commands that manipulate the schema in a persistent-checkout kind of way over time, as well as spurious reporting requests that require incredibly complex filtering and combination of data from many tables. I'm pretty much up to my keister data (and am resisting impulses for denormalization), so this approach probably isn't viable for me. Now I *could* create a temporary table with the group of values, but I suspect the cost of that substantially outweighs the negative performance of current IN lists or parameterized statements. I'm reminded to relay to the PostgreSQL devos that I might be able to do more in the join or subquery department if PostgreSQL had better performing MAX functions and a FIRST function for selecting rows from groups. ("Performing" being the operative word here, since the extensible architecture of PostgreSQL currently makes for poorly performing MAX capabilities and presumably similar user defined aggregate functions). > > Regards, > > Andreas > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
Mario Weilguni wrote:
sizes 100 - 1000. It's also likely to be of use only if the range for the between is fairly restricted,
which isn't necessarily characteristic of my data.
Dave
Very interesting! I tried it out, but it didn't appreciably change the thresholds in my results for going by for IN listMy application relies heavily on IN lists. The lists are primarily constant integers, so queries look like: SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...) Performance is critical, and the size of these lists depends a lot on how the larger 3-tier applicaiton is used, but it wouldn't be out of the question to retrieve 3000-10000 items. PostgreSQL 7.3.2 seems to have a lot of trouble with large lists.you should rewrite your query if the query is created from an applition: SELECT val FROM tableWHERE id between 43 and 100002 AND id IN (43, 49, 1001, 100002, ...) where 43 is the min and 100002 the max of all values. I had this case with postgresql 7.2 and the planner made much smarter choices in my case. Regards, Mario Weilguni
sizes 100 - 1000. It's also likely to be of use only if the range for the between is fairly restricted,
which isn't necessarily characteristic of my data.
Dave
> I'm reminded to relay to the PostgreSQL devos that I might be able to do > more in the join or subquery department if > PostgreSQL had better performing MAX functions and a FIRST function for > selecting rows from groups. > ("Performing" being the operative word here, since the extensible > architecture of PostgreSQL currently makes for poorly > performing MAX capabilities and presumably similar user defined > aggregate functions). MIN/MAX is almost in every case replaceable: select bar from foo order by bar limit 1; instead of select max(bar) from foo;
On Wed, May 28, 2003 at 14:08:02 -0400, Dave Tenny <tenny@attbi.com> wrote: > Andreas Pflug wrote: > > I'm reminded to relay to the PostgreSQL devos that I might be able to do > more in the join or subquery department if > PostgreSQL had better performing MAX functions and a FIRST function for > selecting rows from groups. > ("Performing" being the operative word here, since the extensible > architecture of PostgreSQL currently makes for poorly > performing MAX capabilities and presumably similar user defined > aggregate functions). Have you tried replacing max with a subselect that uses order by and limit?
On Wed, May 28, 2003 at 13:58:14 -0400, Dave Tenny <tenny@attbi.com> wrote: > A join isn't an option, these elements come a a selection of entity ID's > that are specific to some client context. > Some other people suggested joins too. You can union the values together and then join (or use where exists) with the result. This may not be faster and you may not be able to union several thousand selects together in a single statement. But it shouldn't be too much work to test it out.
Mario Weilguni wrote:
There are some places however where it doesn't work well in query logic, though I don't have an example off the top of my head
since I've worked around it in all my queries.
Yup, been there, done that, but thanks, it's a good tidbit for the postgresql unaware.I'm reminded to relay to the PostgreSQL devos that I might be able to do more in the join or subquery department if PostgreSQL had better performing MAX functions and a FIRST function for selecting rows from groups. ("Performing" being the operative word here, since the extensible architecture of PostgreSQL currently makes for poorly performing MAX capabilities and presumably similar user defined aggregate functions).MIN/MAX is almost in every case replaceable: select bar from fooorder by bar limit 1; instead of select max(bar) from foo;
There are some places however where it doesn't work well in query logic, though I don't have an example off the top of my head
since I've worked around it in all my queries.
Bruno Wolff III wrote:
I'm uncertain how that would work, since somewhere in there I still need to elaborate on the
1000 items I want, and they're not necessarily in any particular range, nor do they bear any
contiguous group nature.
Also, IN (subquery) is a known performance problem in PGSQL, at least if the subquery is going to return many rows.
It's too bad, since I'm rather fond of subqueries, but I avoid them like the plague in PostgreSQL.
Perhaps I don't understand what you had in mind.
On Wed, May 28, 2003 at 14:08:02 -0400, Dave Tenny <tenny@attbi.com> wrote:Andreas Pflug wrote: I'm reminded to relay to the PostgreSQL devos that I might be able to do more in the join or subquery department if PostgreSQL had better performing MAX functions and a FIRST function for selecting rows from groups. ("Performing" being the operative word here, since the extensible architecture of PostgreSQL currently makes for poorly performing MAX capabilities and presumably similar user defined aggregate functions).Have you tried replacing max with a subselect that uses order by and limit?
I'm uncertain how that would work, since somewhere in there I still need to elaborate on the
1000 items I want, and they're not necessarily in any particular range, nor do they bear any
contiguous group nature.
Also, IN (subquery) is a known performance problem in PGSQL, at least if the subquery is going to return many rows.
It's too bad, since I'm rather fond of subqueries, but I avoid them like the plague in PostgreSQL.
Perhaps I don't understand what you had in mind.
Bruno Wolff III wrote:
test=# select million.id, million.val from million, (select 10000 as a union select 20000 as a) t2 where million.id = t2.a;
id | val
-------+-------
10000 | 0
20000 | 10000
(2 rows)
Ouch! That's deviant. Haven't tried it yet and I cringe at the thought of it, but I might take a run at it. However that's going to
run up the buffer space quickly. That was one of my as yet unsnaswered questions, what is the pragmatic buffer size limit
for queries?
I'm really hoping we'll come up with something better, like an understanding of why IN lists are non-linear in the first
place when the column is indexed, and whether it's fixable through some other means or whether it's a bug that should be fixed.
After all, I'm trying to support multiple databases, and other databases kick butt on this. It's just postgresql that's
having difficulty.
(Btw, I've also tried statement batching, but that's a lose for now, at least with the current JDBC drivers and 7.3.2).
I assume you mean something like:On Wed, May 28, 2003 at 13:58:14 -0400, Dave Tenny <tenny@attbi.com> wrote:A join isn't an option, these elements come a a selection of entity ID's that are specific to some client context. Some other people suggested joins too.You can union the values together and then join (or use where exists) with the result. This may not be faster and you may not be able to union several thousand selects together in a single statement. But it shouldn't be too much work to test it out.
test=# select million.id, million.val from million, (select 10000 as a union select 20000 as a) t2 where million.id = t2.a;
id | val
-------+-------
10000 | 0
20000 | 10000
(2 rows)
Ouch! That's deviant. Haven't tried it yet and I cringe at the thought of it, but I might take a run at it. However that's going to
run up the buffer space quickly. That was one of my as yet unsnaswered questions, what is the pragmatic buffer size limit
for queries?
I'm really hoping we'll come up with something better, like an understanding of why IN lists are non-linear in the first
place when the column is indexed, and whether it's fixable through some other means or whether it's a bug that should be fixed.
After all, I'm trying to support multiple databases, and other databases kick butt on this. It's just postgresql that's
having difficulty.
(Btw, I've also tried statement batching, but that's a lose for now, at least with the current JDBC drivers and 7.3.2).
On Wed, May 28, 2003 at 16:01:34 -0400, Dave Tenny <tenny@attbi.com> wrote: > > Perhaps I don't understand what you had in mind. > I was refering to your comment about max causing problems. But it seems you are aware of the standard work around.
On Wed, May 28, 2003 at 16:13:17 -0400, Dave Tenny <tenny@attbi.com> wrote: > Bruno Wolff III wrote: > > I assume you mean something like: > > test=# select million.id, million.val from million, (select 10000 as a > union select 20000 as a) t2 where million.id = t2.a; > id | val > -------+------- > 10000 | 0 > 20000 | 10000 > (2 rows) > > Ouch! That's deviant. Haven't tried it yet and I cringe at the > thought of it, but I might take a run at it. However that's going to > run up the buffer space quickly. That was one of my as yet unsnaswered > questions, what is the pragmatic buffer size limit > for queries? That is what I was referring to. I have used this in some cases where I knew the list was small and I wanted to do a set difference without loading a temporary table. Or to do an insert of multiple rows with one insert statement. > I'm /really/ hoping we'll come up with something better, like an > understanding of why IN lists are non-linear in the first > place when the column is indexed, and whether it's fixable through some > other means or whether it's a bug that should be fixed. It also might be worth seeing if the development version is going to speed things up for you. Beta is one month away. My guess is that the production release will be in September.
Dave Tenny <tenny@attbi.com> writes: > My application relies heavily on IN lists. The lists are primarily > constant integers, so queries look like: > SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...) > 1) PostgreSQL exhibits worse-than-linear performance behavior with > respect to IN list size. Yeah. There are a couple of places in the planner that have O(N^2) behavior on sufficiently large WHERE clauses, due to building lists in a naive way (repeated lappend() operations). The inner loop is very tight, but nonetheless when you do it tens of millions of times it adds up :-( I have just committed some fixes into CVS tip for this --- I see about a 10x speedup in planning time on test cases involving 10000-OR-item WHERE clauses. We looked at this once before; the test cases I'm using actually date back to Jan 2000. But it seems some slowness has crept in due to subsequent planning improvements. > 4) God help you if you haven't vacuum/analyzed that the newly loaded > table. Without knowledge that the id field is unique, the planner is likely to tilt away from an indexscan with not too many IN items. I don't consider this a bug. > PostgreSQL craps out trying to process 8000 elements with the error: > out of free buffers: time to abort! This is a known bug in 7.3.2; it's fixed in 7.3.3. regards, tom lane
Tom Lane wrote:
Does that mean that the time for the O(1) portions of the logic are perhaps also in need of a tuneup?
(I would think O(N^2) for trivial operations on a fast machine wouldn't manifest quite so much
for smallish N).
I'm afraid I'm not up to testing CVS tips.
parameterized statements were successfully using the index of the freshly-created-but-unanalyzed table,
or else the times on those queries would have been terrible too. It was only the IN list form
of query that wasn't making correct use of the index. How can the planner recognize uniqueness for
one case but not the other? (Since I ran both forms of query on the same table with and without vacuuming).
Dave
It was also showing up very clearly in the timings I attached for just a couple hundred IN list entries too.Dave Tenny <tenny@attbi.com> writes:My application relies heavily on IN lists. The lists are primarily constant integers, so queries look like: SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)1) PostgreSQL exhibits worse-than-linear performance behavior with respect to IN list size.Yeah. There are a couple of places in the planner that have O(N^2) behavior on sufficiently large WHERE clauses, due to building lists in a naive way (repeated lappend() operations). The inner loop is very tight, but nonetheless when you do it tens of millions of times it adds up :-(
Does that mean that the time for the O(1) portions of the logic are perhaps also in need of a tuneup?
(I would think O(N^2) for trivial operations on a fast machine wouldn't manifest quite so much
for smallish N).
So what version might that equate to down the road, so I can be sure to check it out?I have just committed some fixes into CVS tip for this --- I see about a 10x speedup in planning time on test cases involving 10000-OR-item WHERE clauses. We looked at this once before; the test cases I'm using actually date back to Jan 2000. But it seems some slowness has crept in due to subsequent planning improvements.
I'm afraid I'm not up to testing CVS tips.
There is one very interesting thing in my test case though. It certainly seemed as if the4) God help you if you haven't vacuum/analyzed that the newly loaded table.Without knowledge that the id field is unique, the planner is likely to tilt away from an indexscan with not too many IN items. I don't consider this a bug.
parameterized statements were successfully using the index of the freshly-created-but-unanalyzed table,
or else the times on those queries would have been terrible too. It was only the IN list form
of query that wasn't making correct use of the index. How can the planner recognize uniqueness for
one case but not the other? (Since I ran both forms of query on the same table with and without vacuuming).
Thanks, that's good to know.PostgreSQL craps out trying to process 8000 elements with the error: out of free buffers: time to abort!This is a known bug in 7.3.2; it's fixed in 7.3.3.
Dave
Dave Tenny <tenny@attbi.com> writes: > </blockquote> > There is one very interesting thing in my test case though. It > certainly <i>seemed</i> as if the<br> > parameterized statements were successfully using the index of the > freshly-created-but-unanalyzed table,<br> > or else the times on those queries would have been terrible too. It > was only the IN list form<br> > of query that wasn't making correct use of the index. How can the > planner recognize uniqueness for<br> > one case but not the other? The question is whether a seqscan will be faster than an indexscan; at some point there's less I/O involved to just scan the table once. If the planner doesn't know the index is unique then it's going to estimate a higher cost for the indexscan (due to more rows fetched) and there is some number of rows at which it will flip over to a seqscan. The same will happen even if it *does* know the index is unique, it's just that it will take more IN elements to make it happen. This is reasonable behavior IMHO, although whether the flip-over point is anywhere near the actual breakeven point on your hardware is anyone's guess. The cost estimates are often far enough off that it's not very close. regards, tom lane
The IN list processing has been fixed in 7.4CVS. It now uses a hash based lookup rather than a list, so it's vastly faster.
Chris
----- Original Message -----From: Dave TennySent: Thursday, May 29, 2003 1:58 AMSubject: Re: [PERFORM] IN list processing performance (yet again)A join isn't an option, these elements come a a selection of entity ID's that are specific to some client context.
Some other people suggested joins too.
Consider it something like this, say there's a database that represents the entire file system content
of a set of machines, hundreds of thousands of files. A single user wants to do something
related to the ID's of 3000 files. The requests for those 3000 files can be built up in a number of ways,
not all of which rely on data in the database. So I need to retrieve data on those 3000 files using IN lists or some alternative.
Dave
Shridhar Daithankar wrote:On Wednesday 28 May 2003 18:21, Dave Tenny wrote:Having grepped the web, it's clear that this isn't the first or last time this issue will be raised. My application relies heavily on IN lists. The lists are primarily constant integers, so queries look like: SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)How do you derive this list of number? If it is from same database, can you rewrite the query using a join statement? HTH Shridhar ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster
> Also, IN (subquery) is a known performance problem in PGSQL, at least if the subquery is going to return > many rows. > It's too bad, since I'm rather fond of subqueries, but I avoid them like the plague in PostgreSQL. You're not really using a subquery - really just a long list of integers. Subqueries are lightning fast, so long as you conver to the EXISTS form: SELECT * FROM tab WHERE id IN (SELECT id2 FROM tab2); converts to: SELECT * FROM tab WHERE EXISTS (SELECT id2 FROM tab2 WHERE id2=id); Chris
On Wed, May 28, 2003 at 21:19:56 -0400, Dave Tenny <tenny@attbi.com> wrote: > So what version might that equate to down the road, so I can be sure to > check it out? > I'm afraid I'm not up to testing CVS tips. 7.4 7.4 is suppsoed to go into beta July 1. If you are used to installing from source tarballs, you can grab a development snapshot tarball. There is probably a day or two delay depending on if you get it from a mirror or the primary site.
Christopher Kings-Lynne wrote:
Oops, you got that out of context, it was a different piece of conversation about subqueries in IN predicate,
not the scalar forms that was my overall discussion point. You're right, I'm using lists of integers,
someone else was suggesting using subqueries in some context and I was responding to that.
(Again, it's a side topic, my primary concern is scalar-form IN lists.)
Thanks,
Dave
Also, IN (subquery) is a known performance problem in PGSQL, at least ifthe subquery is going to return > many rows.It's too bad, since I'm rather fond of subqueries, but I avoid them likethe plague in PostgreSQL. You're not really using a subquery - really just a long list of integers.
Oops, you got that out of context, it was a different piece of conversation about subqueries in IN predicate,
not the scalar forms that was my overall discussion point. You're right, I'm using lists of integers,
someone else was suggesting using subqueries in some context and I was responding to that.
I hadn't thought of that, it's an excellent tip. I'll have to remember it next time I want to use subqueries.Subqueries are lightning fast, so long as you conver to the EXISTS form: SELECT * FROM tab WHERE id IN (SELECT id2 FROM tab2); converts to: SELECT * FROM tab WHERE EXISTS (SELECT id2 FROM tab2 WHERE id2=id); Chris
(Again, it's a side topic, my primary concern is scalar-form IN lists.)
Thanks,
Dave