Thread: IN list processing performance (yet again)

IN list processing performance (yet again)

From

Dave Tenny

Date:

28 May 2003, 08:51:55

Having grepped the web, it's clear that this isn't the first or last
time this issue will be raised.

My application relies heavily on IN lists.  The lists are primarily
constant integers, so queries look like:

SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)

Performance is critical, and the size of these lists depends a lot on
how the larger 3-tier applicaiton is used,
but it wouldn't be out of the question to retrieve 3000-10000 items.

PostgreSQL 7.3.2 seems to have a lot of trouble with large lists.

I ran an experiment that ran queries on a table of two integers  (ID,
VAL), where ID is a primary key and the subject
of IN list predicates.  The test used a table with one million rows  ID
is appropriately indexed,
and I have VACUUMED/analyzed the database after table load.

I ran tests on in-lists from about 100 to 100,000 entries.

I also ran tests where I picked the rows out one-by-one using
parameterized statements, e.g.

SELECT val FROM table WHERE id = ?

I'm trying to decide how much I should use parameterized statements and
when to work around buffer size limitations
in JDBC transport, general query processing, etc.

So here are my questions as a prelude to the data noted below:

1) What is the acceptable limit of jdbc Statement buffer sizes for
executeQuery()?
      (Presumably it's not really a JDBC question, but a PostgreSQL
query/buffering/transport question).

2) What is the expected acceptable limit for the number of items in an
IN list predicate such as
      those used here.  (List of constants, not subselects).

3) What should I expect for performance capabilities/limitations from
PostgreSQL on this type of problem?
      (Set my expectations, perhaps they're unrealistic).

4) What are my alternatives for picking specific rows for thousands of
elements with sub-second response times
      if not IN lists?      (This is crucial!)

---------------------------------------------

Here is a summary of my observations of query times, and I've attached
the test program (more on that below).

1) PostgreSQL exhibits worse-than-linear performance behavior with
respect to IN list size.
      This is bad.

2) Parameterized statements exhibit the expected linear performance
characteristics.

3) The break-even point for using IN lists vs. parameterized statements
in my environment

      (RedHat Linux 9.0, PostgreSQL 7.3.2, 512MB memory, 7200RPM 100UDMA
IDE disks, AMD1600Mhz)

      is about 700 items in the IN list.  Beyond that, IN the IN list
scalability curve makes it impractical.

4)  God help you if you haven't vacuum/analyzed that the newly loaded
table.  Without this,
      IN list processing is even worse!

     For just 10 elements in the IN list:

     *Without* VACUUMDB, IN lists suck beyond measure:
    Elapsed time for 10 IN list elements: 2638 ms
    Elapsed time for 10 parameterized elements: 9 ms

   *With* VACUUMDB: IN lists recover a bit:
   Elapsed time for 10 IN list elements: 10 ms
   Elapsed time for 10 parameterized elements: 24 ms

    However it's VERY interesting to note that parameterized statements
worked well.  That implies
    probable disparity in plan generation.  It's worth noting that it
didn't *feel* like I was getting the same
    delay when I ran the query from the 'psql' client, but since it
doesn't report times I can't be sure.

5) Rest of my results are vacuumed, (and listed in the attached program
in detail).

     The interesting points are:

      For an IN list of 700 elements:

         MySQL 3.23.56 (with INNODB tables)  takes 19ms, 73ms with
parameterized statements.
         PostgreSQL takes 269ms, 263ms with parameterized statements.

      For larger lists, MySQL happily processed a 90,000 element IN list
in 1449ms,
      9283 ms using parameterized statements.

      PostgreSQL  craps out trying to process 8000 elements with the error:
      out of free buffers: time to abort!

      PostgreSQL takes 45,566ms for 7000 elements in an IN list (ouch!)
, and 2027ms for a parameterized statement.
      MySQL easily beats that with 10 times the data.

6) Using a remote client on the lan (10/100) to run the client piece on
a separate machine from the database
     server yielded little change int he results.  Prepared statements
worked pretty well even with actual wire latency,
     surprise!  (Of course it's very little latency on my lan, not like,
say, the database server running in a different city
     which customers have been known to do).

The MySQL and PostgreSQL installations are the default RedHat 9.0
distribution packages,
I haven't tweaked either's installation parameters. (though MySQL was
updated by RedHat from 3.23.54a to 3.23.56
as part of an automatic upgrade).

My goal here isn't to say "why aren't you like MySQL".  I've been using
PostgreSQL for a year as the development database of choice
with satisfactory results.  But I won't be able to support customers who
want to use PostgreSQL on large deployments of my products
if I can't selectively retrieve data for several thousand elements in
sub-second times.

(PostgreSQL devos, if you want a feel good bullet, note that I don't
support MySQL at all since lack of MVCC transactions
is a showstopper from a multi-user performance standpoint).

So I'm looking for (a) solutions, answers to my questions above, and (b)
a statement of "we're on this" or "you're stuck with it" from
PostgreSQL devos who know.

----------------------------------------------------------------
On the attached program. (a Java JDBC program) Sorry, you can't just run
it "as is".  Search for formfeeds (^L) if you want
to skip all the result data I logged.  Compilation is straightforward,
simply supply the location of your JDBC jar file for compiling and running
(mine is noted in the program).

First, move the "if (false)" element below the table creation statements
and run the program to create the table.
Then VACUUM/analyze your database (suuuuure would be nice not to have to
vacuum).
Then move the "if (false)" element above the table creation so you won't
have to do it every time.
Then move past the formfeed and adjust the 'tryAmount' for loop to test
the amounts you're interested.
100 to 1000 by 100's is a good starting point.

Ignore the section (part of the if (false) logic) that attempts to
figure out what the largest query size is.
Unless you want to reproduce a hung postmaster in a CPU loop, which is
what I got when I tried to run that logic,
though whan I ran it I was missing the closing ')' in my IN list, which
I've since added to that code.

Thanks for any help!

Dave
import java.sql.* ;

/*
Query huge inlists, test for buffer overflow and performance using in-lists vs.
parameterized statements, and perhaps batched statements (not yet)

Results:
MySQL blows PostgreSQL away here, and that's AFTER doing a VACUUMDB in PostgreSQL
PostgreSQL does ok with VACUUMDB to 700 entries, the prepared statements are better.
MySQL: IN lists are always superior to any practical limit (tested to 100,000 elements)

Local-to-local on AMD1600Mhz, MySQL 3.23.56 with JDBC driver 3.0.8
[MYSQL was much better here...]

Table update time: 109927 ms

For 10,000 -> 100,000 by 10,000
Elapsed time for 90000 IN list elements: 1449 ms
62 elements/ms
Elapsed time for 90000 parameterized elements: 9283 ms
9 elements/ms

For 100 -> 10,000 by 100
Last received via IN list: 99
Elapsed time for 100 IN list elements: 6 ms
16 elements/ms
Elapsed time for 100 parameterized elements: 20 ms
5 elements/ms
Last received via parameterized elements: 99
Last received via IN list: 199
Elapsed time for 200 IN list elements: 7 ms
28 elements/ms
Elapsed time for 200 parameterized elements: 45 ms
4 elements/ms
Last received via parameterized elements: 199
Last received via IN list: 299
Elapsed time for 300 IN list elements: 22 ms
13 elements/ms
Elapsed time for 300 parameterized elements: 64 ms
4 elements/ms
Last received via parameterized elements: 299
Last received via IN list: 399
Elapsed time for 400 IN list elements: 39 ms
10 elements/ms
Elapsed time for 400 parameterized elements: 63 ms
6 elements/ms
Last received via parameterized elements: 399
Last received via IN list: 499
Elapsed time for 500 IN list elements: 12 ms
41 elements/ms
Elapsed time for 500 parameterized elements: 116 ms
4 elements/ms
Last received via parameterized elements: 499
Last received via IN list: 599
Elapsed time for 600 IN list elements: 8 ms
75 elements/ms
Elapsed time for 600 parameterized elements: 65 ms
9 elements/ms
Last received via parameterized elements: 599
Last received via IN list: 699
Elapsed time for 700 IN list elements: 19 ms
36 elements/ms
Elapsed time for 700 parameterized elements: 73 ms
9 elements/ms
Last received via parameterized elements: 699
Last received via IN list: 799
Elapsed time for 800 IN list elements: 12 ms
66 elements/ms
Elapsed time for 800 parameterized elements: 84 ms
9 elements/ms
Last received via parameterized elements: 799
Last received via IN list: 899
Elapsed time for 900 IN list elements: 12 ms
75 elements/ms
Elapsed time for 900 parameterized elements: 97 ms
9 elements/ms
Last received via parameterized elements: 899
Last received via IN list: 999
Elapsed time for 1000 IN list elements: 15 ms
66 elements/ms
Elapsed time for 1000 parameterized elements: 109 ms
9 elements/ms
Last received via parameterized elements: 999
Last received via IN list: 1099
Elapsed time for 1100 IN list elements: 16 ms
68 elements/ms
Elapsed time for 1100 parameterized elements: 115 ms
9 elements/ms
Last received via parameterized elements: 1099
Last received via IN list: 1199
Elapsed time for 1200 IN list elements: 16 ms
75 elements/ms
Elapsed time for 1200 parameterized elements: 138 ms
8 elements/ms
Last received via parameterized elements: 1199
Last received via IN list: 1299
Elapsed time for 1300 IN list elements: 19 ms
68 elements/ms
Elapsed time for 1300 parameterized elements: 136 ms
9 elements/ms
Last received via parameterized elements: 1299
Last received via IN list: 1399
Elapsed time for 1400 IN list elements: 21 ms
66 elements/ms
Elapsed time for 1400 parameterized elements: 150 ms
9 elements/ms
Last received via parameterized elements: 1399
Last received via IN list: 1499
Elapsed time for 1500 IN list elements: 21 ms
71 elements/ms
Elapsed time for 1500 parameterized elements: 158 ms
9 elements/ms
Last received via parameterized elements: 1499
Last received via IN list: 1599
Elapsed time for 1600 IN list elements: 22 ms
72 elements/ms
Elapsed time for 1600 parameterized elements: 178 ms
8 elements/ms
Last received via parameterized elements: 1599
Last received via IN list: 1699
Elapsed time for 1700 IN list elements: 22 ms
77 elements/ms
Elapsed time for 1700 parameterized elements: 180 ms
9 elements/ms
Last received via parameterized elements: 1699
Last received via IN list: 1799
Elapsed time for 1800 IN list elements: 26 ms
69 elements/ms
Elapsed time for 1800 parameterized elements: 205 ms
8 elements/ms
Last received via parameterized elements: 1799
Last received via IN list: 1899
Elapsed time for 1900 IN list elements: 57 ms
33 elements/ms
Elapsed time for 1900 parameterized elements: 198 ms
9 elements/ms
Last received via parameterized elements: 1899
Last received via IN list: 1999
Elapsed time for 2000 IN list elements: 29 ms
68 elements/ms
Elapsed time for 2000 parameterized elements: 215 ms
9 elements/ms
Last received via parameterized elements: 1999
Last received via IN list: 2099
Elapsed time for 2100 IN list elements: 30 ms
70 elements/ms
Elapsed time for 2100 parameterized elements: 222 ms
9 elements/ms
Last received via parameterized elements: 2099
Last received via IN list: 2199
Elapsed time for 2200 IN list elements: 32 ms
68 elements/ms
Elapsed time for 2200 parameterized elements: 243 ms
9 elements/ms
Last received via parameterized elements: 2199
Last received via IN list: 2299
Elapsed time for 2300 IN list elements: 32 ms
71 elements/ms
Elapsed time for 2300 parameterized elements: 241 ms
9 elements/ms
Last received via parameterized elements: 2299
Last received via IN list: 2399
Elapsed time for 2400 IN list elements: 33 ms
72 elements/ms
Elapsed time for 2400 parameterized elements: 260 ms
9 elements/ms
Last received via parameterized elements: 2399
Last received via IN list: 2499
Elapsed time for 2500 IN list elements: 64 ms
39 elements/ms
Elapsed time for 2500 parameterized elements: 275 ms
9 elements/ms
Last received via parameterized elements: 2499
Last received via IN list: 2599
Elapsed time for 2600 IN list elements: 35 ms
74 elements/ms
Elapsed time for 2600 parameterized elements: 275 ms
9 elements/ms
Last received via parameterized elements: 2599
Last received via IN list: 2699
Elapsed time for 2700 IN list elements: 36 ms
75 elements/ms
Elapsed time for 2700 parameterized elements: 292 ms
9 elements/ms
Last received via parameterized elements: 2699
Last received via IN list: 2799
Elapsed time for 2800 IN list elements: 41 ms
68 elements/ms
Elapsed time for 2800 parameterized elements: 308 ms
9 elements/ms
Last received via parameterized elements: 2799
Last received via IN list: 2899
Elapsed time for 2900 IN list elements: 40 ms
72 elements/ms
Elapsed time for 2900 parameterized elements: 341 ms
8 elements/ms
Last received via parameterized elements: 2899
Last received via IN list: 2999
Elapsed time for 3000 IN list elements: 42 ms
71 elements/ms
Elapsed time for 3000 parameterized elements: 356 ms
8 elements/ms
Last received via parameterized elements: 2999
Last received via IN list: 3099
Elapsed time for 3100 IN list elements: 42 ms
73 elements/ms
Elapsed time for 3100 parameterized elements: 342 ms
9 elements/ms
Last received via parameterized elements: 3099
Last received via IN list: 3199
Elapsed time for 3200 IN list elements: 43 ms
74 elements/ms
Elapsed time for 3200 parameterized elements: 345 ms
9 elements/ms
Last received via parameterized elements: 3199
Last received via IN list: 3299
Elapsed time for 3300 IN list elements: 43 ms
76 elements/ms
Elapsed time for 3300 parameterized elements: 359 ms
9 elements/ms
Last received via parameterized elements: 3299
Last received via IN list: 3399
Elapsed time for 3400 IN list elements: 77 ms
44 elements/ms
Elapsed time for 3400 parameterized elements: 356 ms
9 elements/ms
Last received via parameterized elements: 3399
Last received via IN list: 3499
Elapsed time for 3500 IN list elements: 50 ms
70 elements/ms
Elapsed time for 3500 parameterized elements: 370 ms
9 elements/ms
Last received via parameterized elements: 3499
Last received via IN list: 3599
Elapsed time for 3600 IN list elements: 84 ms
42 elements/ms
Elapsed time for 3600 parameterized elements: 387 ms
9 elements/ms
Last received via parameterized elements: 3599
Last received via IN list: 3699
Elapsed time for 3700 IN list elements: 52 ms
71 elements/ms
Elapsed time for 3700 parameterized elements: 390 ms
9 elements/ms
Last received via parameterized elements: 3699
Last received via IN list: 3799
Elapsed time for 3800 IN list elements: 54 ms
70 elements/ms
Elapsed time for 3800 parameterized elements: 407 ms
9 elements/ms
Last received via parameterized elements: 3799
Last received via IN list: 3899
Elapsed time for 3900 IN list elements: 84 ms
46 elements/ms
Elapsed time for 3900 parameterized elements: 422 ms
9 elements/ms
Last received via parameterized elements: 3899
Last received via IN list: 3999
Elapsed time for 4000 IN list elements: 58 ms
68 elements/ms
Elapsed time for 4000 parameterized elements: 434 ms
9 elements/ms
Last received via parameterized elements: 3999
Last received via IN list: 4099
Elapsed time for 4100 IN list elements: 58 ms
70 elements/ms
Elapsed time for 4100 parameterized elements: 446 ms
9 elements/ms
Last received via parameterized elements: 4099
Last received via IN list: 4199
Elapsed time for 4200 IN list elements: 61 ms
68 elements/ms
Elapsed time for 4200 parameterized elements: 453 ms
9 elements/ms
Last received via parameterized elements: 4199
Last received via IN list: 4299
Elapsed time for 4300 IN list elements: 63 ms
68 elements/ms
Elapsed time for 4300 parameterized elements: 457 ms
9 elements/ms
Last received via parameterized elements: 4299
Last received via IN list: 4399
Elapsed time for 4400 IN list elements: 94 ms
46 elements/ms
Elapsed time for 4400 parameterized elements: 473 ms
9 elements/ms
Last received via parameterized elements: 4399
Last received via IN list: 4499
Elapsed time for 4500 IN list elements: 64 ms
70 elements/ms
Elapsed time for 4500 parameterized elements: 506 ms
8 elements/ms
Last received via parameterized elements: 4499
Last received via IN list: 4599
Elapsed time for 4600 IN list elements: 72 ms
63 elements/ms
Elapsed time for 4600 parameterized elements: 527 ms
8 elements/ms
Last received via parameterized elements: 4599
Last received via IN list: 4699
Elapsed time for 4700 IN list elements: 68 ms
69 elements/ms
Elapsed time for 4700 parameterized elements: 499 ms
9 elements/ms
Last received via parameterized elements: 4699
Last received via IN list: 4799
Elapsed time for 4800 IN list elements: 70 ms
68 elements/ms
Elapsed time for 4800 parameterized elements: 539 ms
8 elements/ms
Last received via parameterized elements: 4799
Last received via IN list: 4899
Elapsed time for 4900 IN list elements: 100 ms
49 elements/ms
Elapsed time for 4900 parameterized elements: 590 ms
8 elements/ms
Last received via parameterized elements: 4899
Last received via IN list: 4999
Elapsed time for 5000 IN list elements: 111 ms
45 elements/ms
Elapsed time for 5000 parameterized elements: 531 ms
9 elements/ms
Last received via parameterized elements: 4999
Last received via IN list: 5099
Elapsed time for 5100 IN list elements: 79 ms
64 elements/ms
Elapsed time for 5100 parameterized elements: 542 ms
9 elements/ms
Last received via parameterized elements: 5099
Last received via IN list: 5199
Elapsed time for 5200 IN list elements: 108 ms
48 elements/ms
Elapsed time for 5200 parameterized elements: 560 ms
9 elements/ms
Last received via parameterized elements: 5199
Last received via IN list: 5299
Elapsed time for 5300 IN list elements: 78 ms
67 elements/ms
Elapsed time for 5300 parameterized elements: 560 ms
9 elements/ms
Last received via parameterized elements: 5299
Last received via IN list: 5399
Elapsed time for 5400 IN list elements: 81 ms
66 elements/ms
Elapsed time for 5400 parameterized elements: 572 ms
9 elements/ms
Last received via parameterized elements: 5399
Last received via IN list: 5499
Elapsed time for 5500 IN list elements: 80 ms
68 elements/ms
Elapsed time for 5500 parameterized elements: 627 ms
8 elements/ms
Last received via parameterized elements: 5499
Last received via IN list: 5599
Elapsed time for 5600 IN list elements: 81 ms
69 elements/ms
Elapsed time for 5600 parameterized elements: 590 ms
9 elements/ms
Last received via parameterized elements: 5599
Last received via IN list: 5699
Elapsed time for 5700 IN list elements: 86 ms
66 elements/ms
Elapsed time for 5700 parameterized elements: 606 ms
9 elements/ms
Last received via parameterized elements: 5699
Last received via IN list: 5799
Elapsed time for 5800 IN list elements: 85 ms
68 elements/ms
Elapsed time for 5800 parameterized elements: 663 ms
8 elements/ms
Last received via parameterized elements: 5799
Last received via IN list: 5899
Elapsed time for 5900 IN list elements: 83 ms
71 elements/ms
Elapsed time for 5900 parameterized elements: 629 ms
9 elements/ms
Last received via parameterized elements: 5899
Last received via IN list: 5999
Elapsed time for 6000 IN list elements: 91 ms
65 elements/ms
Elapsed time for 6000 parameterized elements: 658 ms
9 elements/ms
Last received via parameterized elements: 5999
Last received via IN list: 6099
Elapsed time for 6100 IN list elements: 87 ms
70 elements/ms
Elapsed time for 6100 parameterized elements: 702 ms
8 elements/ms
Last received via parameterized elements: 6099
Last received via IN list: 6199
Elapsed time for 6200 IN list elements: 88 ms
70 elements/ms
Elapsed time for 6200 parameterized elements: 667 ms
9 elements/ms
Last received via parameterized elements: 6199
Last received via IN list: 6299
Elapsed time for 6300 IN list elements: 91 ms
69 elements/ms
Elapsed time for 6300 parameterized elements: 670 ms
9 elements/ms
Last received via parameterized elements: 6299
Last received via IN list: 6399
Elapsed time for 6400 IN list elements: 91 ms
70 elements/ms
Elapsed time for 6400 parameterized elements: 708 ms
9 elements/ms
Last received via parameterized elements: 6399
Last received via IN list: 6499
Elapsed time for 6500 IN list elements: 96 ms
67 elements/ms
Elapsed time for 6500 parameterized elements: 687 ms
9 elements/ms
Last received via parameterized elements: 6499
Last received via IN list: 6599
Elapsed time for 6600 IN list elements: 95 ms
69 elements/ms
Elapsed time for 6600 parameterized elements: 701 ms
9 elements/ms
Last received via parameterized elements: 6599
Last received via IN list: 6699
Elapsed time for 6700 IN list elements: 134 ms
50 elements/ms
Elapsed time for 6700 parameterized elements: 711 ms
9 elements/ms
Last received via parameterized elements: 6699
Last received via IN list: 6799
Elapsed time for 6800 IN list elements: 98 ms
69 elements/ms
Elapsed time for 6800 parameterized elements: 723 ms
9 elements/ms
Last received via parameterized elements: 6799
Last received via IN list: 6899
Elapsed time for 6900 IN list elements: 99 ms
69 elements/ms
Elapsed time for 6900 parameterized elements: 728 ms
9 elements/ms
Last received via parameterized elements: 6899
Last received via IN list: 6999
Elapsed time for 7000 IN list elements: 141 ms
49 elements/ms
Elapsed time for 7000 parameterized elements: 733 ms
9 elements/ms
Last received via parameterized elements: 6999
Last received via IN list: 7099
Elapsed time for 7100 IN list elements: 108 ms
65 elements/ms
Elapsed time for 7100 parameterized elements: 752 ms
9 elements/ms
Last received via parameterized elements: 7099
Last received via IN list: 7199
Elapsed time for 7200 IN list elements: 104 ms
69 elements/ms
Elapsed time for 7200 parameterized elements: 769 ms
9 elements/ms
Last received via parameterized elements: 7199
Last received via IN list: 7299
Elapsed time for 7300 IN list elements: 142 ms
51 elements/ms
Elapsed time for 7300 parameterized elements: 778 ms
9 elements/ms
Last received via parameterized elements: 7299
Last received via IN list: 7399
Elapsed time for 7400 IN list elements: 127 ms
58 elements/ms
Elapsed time for 7400 parameterized elements: 796 ms
9 elements/ms
Last received via parameterized elements: 7399
Last received via IN list: 7499
Elapsed time for 7500 IN list elements: 110 ms
68 elements/ms
Elapsed time for 7500 parameterized elements: 796 ms
9 elements/ms
Last received via parameterized elements: 7499
Last received via IN list: 7599
Elapsed time for 7600 IN list elements: 144 ms
52 elements/ms
Elapsed time for 7600 parameterized elements: 994 ms
7 elements/ms
Last received via parameterized elements: 7599
Last received via IN list: 7699
Elapsed time for 7700 IN list elements: 122 ms
63 elements/ms
Elapsed time for 7700 parameterized elements: 819 ms
9 elements/ms
Last received via parameterized elements: 7699
Last received via IN list: 7799
Elapsed time for 7800 IN list elements: 114 ms
68 elements/ms
Elapsed time for 7800 parameterized elements: 893 ms
8 elements/ms
Last received via parameterized elements: 7799
Last received via IN list: 7899
Elapsed time for 7900 IN list elements: 152 ms
51 elements/ms
Elapsed time for 7900 parameterized elements: 841 ms
9 elements/ms
Last received via parameterized elements: 7899
Last received via IN list: 7999
Elapsed time for 8000 IN list elements: 119 ms
67 elements/ms
Elapsed time for 8000 parameterized elements: 853 ms
9 elements/ms
Last received via parameterized elements: 7999
Last received via IN list: 8099
Elapsed time for 8100 IN list elements: 121 ms
66 elements/ms
Elapsed time for 8100 parameterized elements: 858 ms
9 elements/ms
Last received via parameterized elements: 8099
Last received via IN list: 8199
Elapsed time for 8200 IN list elements: 157 ms
52 elements/ms
Elapsed time for 8200 parameterized elements: 871 ms
9 elements/ms
Last received via parameterized elements: 8199
Last received via IN list: 8299
Elapsed time for 8300 IN list elements: 131 ms
63 elements/ms
Elapsed time for 8300 parameterized elements: 879 ms
9 elements/ms
Last received via parameterized elements: 8299
Last received via IN list: 8399
Elapsed time for 8400 IN list elements: 124 ms
67 elements/ms
Elapsed time for 8400 parameterized elements: 890 ms
9 elements/ms
Last received via parameterized elements: 8399
Last received via IN list: 8499
Elapsed time for 8500 IN list elements: 163 ms
52 elements/ms
Elapsed time for 8500 parameterized elements: 904 ms
9 elements/ms
Last received via parameterized elements: 8499
Last received via IN list: 8599
Elapsed time for 8600 IN list elements: 131 ms
65 elements/ms
Elapsed time for 8600 parameterized elements: 907 ms
9 elements/ms
Last received via parameterized elements: 8599
Last received via IN list: 8699
Elapsed time for 8700 IN list elements: 127 ms
68 elements/ms
Elapsed time for 8700 parameterized elements: 947 ms
9 elements/ms
Last received via parameterized elements: 8699
Last received via IN list: 8799
Elapsed time for 8800 IN list elements: 166 ms
53 elements/ms
Elapsed time for 8800 parameterized elements: 933 ms
9 elements/ms
Last received via parameterized elements: 8799
Last received via IN list: 8899
Elapsed time for 8900 IN list elements: 133 ms
66 elements/ms
Elapsed time for 8900 parameterized elements: 947 ms
9 elements/ms
Last received via parameterized elements: 8899
Last received via IN list: 8999
Elapsed time for 9000 IN list elements: 142 ms
63 elements/ms
Elapsed time for 9000 parameterized elements: 951 ms
9 elements/ms
Last received via parameterized elements: 8999
Last received via IN list: 9099
Elapsed time for 9100 IN list elements: 173 ms
52 elements/ms
Elapsed time for 9100 parameterized elements: 961 ms
9 elements/ms
Last received via parameterized elements: 9099
Last received via IN list: 9199
Elapsed time for 9200 IN list elements: 138 ms
66 elements/ms
Elapsed time for 9200 parameterized elements: 979 ms
9 elements/ms
Last received via parameterized elements: 9199
Last received via IN list: 9299
Elapsed time for 9300 IN list elements: 135 ms
68 elements/ms
Elapsed time for 9300 parameterized elements: 985 ms
9 elements/ms
Last received via parameterized elements: 9299
Last received via IN list: 9399
Elapsed time for 9400 IN list elements: 175 ms
53 elements/ms
Elapsed time for 9400 parameterized elements: 1000 ms
9 elements/ms
Last received via parameterized elements: 9399
Last received via IN list: 9499
Elapsed time for 9500 IN list elements: 136 ms
69 elements/ms
Elapsed time for 9500 parameterized elements: 1010 ms
9 elements/ms
Last received via parameterized elements: 9499
Last received via IN list: 9599
Elapsed time for 9600 IN list elements: 144 ms
66 elements/ms
Elapsed time for 9600 parameterized elements: 1017 ms
9 elements/ms
Last received via parameterized elements: 9599
Last received via IN list: 9699
Elapsed time for 9700 IN list elements: 186 ms
52 elements/ms
Elapsed time for 9700 parameterized elements: 1021 ms
9 elements/ms
Last received via parameterized elements: 9699
Last received via IN list: 9799
Elapsed time for 9800 IN list elements: 142 ms
69 elements/ms
Elapsed time for 9800 parameterized elements: 1048 ms
9 elements/ms
Last received via parameterized elements: 9799
Last received via IN list: 9899
Elapsed time for 9900 IN list elements: 145 ms
68 elements/ms
Elapsed time for 9900 parameterized elements: 1054 ms
9 elements/ms
Last received via parameterized elements: 9899


Local-to-local on AMD 1600Mhz, PGSQL7.3.2, RH9.0
Table update time: 169xxx ms

*Without* VACUUMDB, IN lists suck beyond measure:
For just 10 elements in the IN list:
Elapsed time for 10 IN list elements: 2638 ms
0 elements/ms
Elapsed time for 10 parameterized elements: 9 ms
1 elements/ms

*With* VACUUMDB: IN lists recover a bit:
Elapsed time for 10 IN list elements: 10 ms
1 elements/ms
Elapsed time for 10 parameterized elements: 24 ms
0 elements/ms
[...]
Elapsed time for 99 IN list elements: 16 ms
6 elements/ms
Elapsed time for 99 parameterized elements: 36 ms
2 elements/ms
Last received via parameterized elements: 98


Testing for some larger numbers of elements, however, reveals another story.
Elapsed time for 1000 IN list elements: 676 ms
1 elements/ms
Elapsed time for 1000 parameterized elements: 351 ms
2 elements/ms
Elapsed time for 2000 IN list elements: 3197 ms
0 elements/ms
Elapsed time for 2000 parameterized elements: 713 ms
2 elements/ms
Elapsed time for 3000 IN list elements: 7266 ms
0 elements/ms
Elapsed time for 3000 parameterized elements: 1080 ms
2 elements/ms
Elapsed time for 4000 IN list elements: 13341 ms
0 elements/ms
Elapsed time for 4000 parameterized elements: 1449 ms
2 elements/ms

Based on the 'by-the-1000's approach, IN lists are on a seriously bad curve,
while parameterized statements are linear.

The cutoff on my AMD 1600 Mhz machine is at about 700 elements:
Elapsed time for 600 IN list elements: 157 ms
3 elements/ms
Elapsed time for 600 parameterized elements: 246 ms
2 elements/ms
Elapsed time for 700 IN list elements: 269 ms
2 elements/ms
Elapsed time for 700 parameterized elements: 263 ms
2 elements/ms

Local(Intel PIII-M 1Ghz)-to-REMOTE(AMD1600Mhz), PostgreSQL 7.3.2

100 -> 1000 by 100's.

IN List Query string length is 637
Last received via IN list: 99
Elapsed time for 100 IN list elements: 22 ms
4 elements/ms
Elapsed time for 100 parameterized elements: 117 ms
0 elements/ms
Last received via parameterized elements: 99
IN List Query string length is 1237
Last received via IN list: 199
Elapsed time for 200 IN list elements: 39 ms
5 elements/ms
Elapsed time for 200 parameterized elements: 169 ms
1 elements/ms
Last received via parameterized elements: 199
IN List Query string length is 1837
Last received via IN list: 299
Elapsed time for 300 IN list elements: 81 ms
3 elements/ms
Elapsed time for 300 parameterized elements: 136 ms
2 elements/ms
Last received via parameterized elements: 299
IN List Query string length is 2437
Last received via IN list: 399
Elapsed time for 400 IN list elements: 108 ms
3 elements/ms
Elapsed time for 400 parameterized elements: 157 ms
2 elements/ms
Last received via parameterized elements: 399
IN List Query string length is 3037
Last received via IN list: 499
Elapsed time for 500 IN list elements: 99 ms
5 elements/ms
Elapsed time for 500 parameterized elements: 232 ms
2 elements/ms
Last received via parameterized elements: 499
IN List Query string length is 3637
Last received via IN list: 599
Elapsed time for 600 IN list elements: 150 ms
4 elements/ms
Elapsed time for 600 parameterized elements: 182 ms
3 elements/ms
Last received via parameterized elements: 599
IN List Query string length is 4237
Last received via IN list: 699
Elapsed time for 700 IN list elements: 256 ms
2 elements/ms
Elapsed time for 700 parameterized elements: 219 ms
3 elements/ms
Last received via parameterized elements: 699
IN List Query string length is 4837
Last received via IN list: 799
Elapsed time for 800 IN list elements: 370 ms
2 elements/ms
Elapsed time for 800 parameterized elements: 245 ms
3 elements/ms
Last received via parameterized elements: 799
IN List Query string length is 5437
Last received via IN list: 899
Elapsed time for 900 IN list elements: 511 ms
1 elements/ms
Elapsed time for 900 parameterized elements: 265 ms
3 elements/ms
Last received via parameterized elements: 899

Curiously, the difference between local->local and local->remote here is
small. The breakeven point was about 650 elements, versus slightly closer
to 700 in local->local.


Local(Intel PIII-M 1Ghz)-to-REMOTE(AMD1600Mhz), PostgreSQL 7.3.2

1000 -> 10000 by 1000's.

IN List Query string length is 6037
Last received via IN list: 999
Elapsed time for 1000 IN list elements: 772 ms
1 elements/ms
Elapsed time for 1000 parameterized elements: 495 ms
2 elements/ms
Last received via parameterized elements: 999
IN List Query string length is 12037
Last received via IN list: 1999
Elapsed time for 2000 IN list elements: 3055 ms
0 elements/ms
Elapsed time for 2000 parameterized elements: 681 ms
2 elements/ms
Last received via parameterized elements: 1999
IN List Query string length is 18037
Last received via IN list: 2999
Elapsed time for 3000 IN list elements: 7205 ms
0 elements/ms
Elapsed time for 3000 parameterized elements: 888 ms
3 elements/ms
Last received via parameterized elements: 2999
IN List Query string length is 24037
Last received via IN list: 3999
Elapsed time for 4000 IN list elements: 12912 ms
0 elements/ms
Elapsed time for 4000 parameterized elements: 1162 ms
3 elements/ms
Last received via parameterized elements: 3999
IN List Query string length is 30037
Last received via IN list: 4999
Elapsed time for 5000 IN list elements: 21844 ms
0 elements/ms
Elapsed time for 5000 parameterized elements: 1461 ms
3 elements/ms
Last received via parameterized elements: 4999
IN List Query string length is 36037
Last received via IN list: 5999
Elapsed time for 6000 IN list elements: 32595 ms
0 elements/ms
Elapsed time for 6000 parameterized elements: 1780 ms
3 elements/ms
Last received via parameterized elements: 5999
IN List Query string length is 42037
Last received via IN list: 6999
Elapsed time for 7000 IN list elements: 46566 ms
0 elements/ms
Elapsed time for 7000 parameterized elements: 2027 ms
3 elements/ms
Last received via parameterized elements: 6999
IN List Query string length is 48037
java.sql.SQLException: ERROR:  out of free buffers: time to abort!

    at org.postgresql.core.QueryExecutor.execute(QueryExecutor.java:126)
    at org.postgresql.jdbc1.AbstractJdbc1Connection.ExecSQL(AbstractJdbc1Connection.java:451)
    at org.postgresql.jdbc1.AbstractJdbc1Statement.execute(AbstractJdbc1Statement.java:281)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:48)
    at org.postgresql.jdbc1.AbstractJdbc1Statement.executeQuery(AbstractJdbc1Statement.java:144)
    at org.postgresql.jdbc1.AbstractJdbc1Statement.executeQuery(AbstractJdbc1Statement.java:132)
    at INList.main(INList.java:775)
*/


class INList
{
  public static void main(String[] args) {
    try {
      final int NROWS = 1000000 ;    // one million
      final int IDSTART = 10000 ;    // first ID value

      // ----------------  MySql driver -----------------
      // -classpath .:/usr/java/mysql-connector-java-2.0.14/mysql-connector-java-2.0.14-bin.jar
      // RH9: -classpath .:/home/dave/mysql-connector-java-3.0.8-stable/mysql-connector-java-3.0.8-stable-bin.jar
      //Class.forName("com.mysql.jdbc.Driver") ;
      //Connection conn = DriverManager.getConnection("jdbc:mysql:///test", "dave", "") ;
      //String innodb = " TYPE=INNODB" ;

      // --------------- postgres 7.3.2 distribution JDBC2 driver linked as /usr/java/postgresql-jdbc.jar
      // -classpath .:/usr/java/postgresql-jdbc.jar
      Class.forName("org.postgresql.Driver") ;
      Connection conn = DriverManager.getConnection("jdbc:postgresql:test", "dave", "") ;
      String innodb = "" ;

      conn.setAutoCommit(false) ;

      long oldmillis, newmillis, elapsed ;
      Statement statement = conn.createStatement() ;
      PreparedStatement pstatement ;
      ResultSet rs ;
      int lastReceived ;

      if (false) {
      statement.executeUpdate("CREATE TABLE MILLION (ID INTEGER PRIMARY KEY, VAL INTEGER)" + innodb) ;
      conn.commit() ;

      // Read that "all statements of a connection commit/rollback as a group, API certainly implies this
      // conn.commit() ; pstatement should potentially fail because table isn't there without this!

      // Create the rows
      pstatement = conn.prepareStatement("INSERT INTO MILLION (ID, VAL) VALUES(?, ?)") ;
      oldmillis = System.currentTimeMillis() ;
      for (int i = 0 ; i < NROWS ; i++) {
    pstatement.setInt(1, IDSTART+i) ;
    pstatement.setInt(2, i) ;
    pstatement.executeUpdate() ;
      }
      conn.commit() ;
      newmillis = System.currentTimeMillis() ;
      pstatement.close() ;
      System.out.println("Table update time: " + (newmillis-oldmillis) + " ms") ;
      rs = statement.executeQuery("SELECT COUNT(*) FROM MILLION") ;
      rs.first() ;
      if (rs.getInt(1) != NROWS)
    System.out.println("**** Expected " + (NROWS) + " rows, got " + rs.getInt(1)) ;
      conn.commit() ;

      // Determine largest IN LIST
      StringBuffer sb = new StringBuffer(8000*1024) ;
      boolean successful = false ;
      int tryAmount = NROWS ;
      while (!successful) {
    sb.setLength(0) ;
    sb.append("SELECT MAX(VAL) FROM MILLION WHERE ID IN (") ;
    for (int i = 0 ; i < tryAmount ; i++) {
      sb.append(i+IDSTART) ;
      if (i < tryAmount-1)
        sb.append(',') ;
    }
    sb.append(')') ;
    try {
      rs = statement.executeQuery(sb.toString()) ;
      successful = true ;
    }
    catch (SQLException e) {
      System.out.println("IN list failed for " + tryAmount + " entries, buffer size was " +
                 sb.length()) ;
      tryAmount -= 10000 ;
      if (tryAmount <= 0)
        break ;            // be done with the while loop
    }
      }                    // while (!successful)
      conn.commit() ;            // more for the next txn than this one, eliminate txn caching
      System.out.println("Largest IN list: " + tryAmount + " elements.") ;
      }                    // if (false)

      else {
    for (int tryAmount = 100 ; tryAmount < 1000 ; tryAmount+= 100) {
    StringBuffer sb = new StringBuffer(tryAmount*20) ;
    sb.append("SELECT VAL FROM MILLION WHERE ID IN (") ;
    for (int i = 0 ; i < tryAmount ; i++) {
      sb.append(i+IDSTART) ;
      if (i+1 < tryAmount)
        sb.append(',') ;
    }
    sb.append(')') ;

      // Try query with largest inlist, getting resultset with all values
      String query = sb.toString() ;
      System.out.println("IN List Query string length is " + query.length()) ;
      //query = query.replaceFirst("MAX(VAL)", "VAL") ;
      oldmillis = System.currentTimeMillis() ;
      rs = statement.executeQuery(query) ;
      lastReceived = -1 ;
      while (rs.next())
    lastReceived = rs.getInt(1) ;
      System.out.println("Last received via IN list: " + lastReceived) ;
      conn.commit() ;
      newmillis = System.currentTimeMillis() ;
      elapsed = newmillis - oldmillis ;
      System.out.println("Elapsed time for " + tryAmount + " IN list elements: " + elapsed + " ms") ;
      System.out.println((tryAmount/elapsed) + " elements/ms") ;

      // Try the query with prepared statements
      lastReceived = -1 ;
      pstatement = conn.prepareStatement("SELECT VAL FROM MILLION WHERE ID = ?") ;
      oldmillis = System.currentTimeMillis() ;
      for (int i = 0 ; i < tryAmount ; i++) {
    pstatement.setInt(1, i+IDSTART) ;
    rs = pstatement.executeQuery() ;
    if (!rs.first())
      System.out.println("Pstatement query failed") ;
    else
      lastReceived = rs.getInt(1) ;
      }
      conn.commit() ;
      newmillis = System.currentTimeMillis() ;
      pstatement.close() ;
      elapsed = newmillis - oldmillis ;
      System.out.println("Elapsed time for " + tryAmount + " parameterized elements: " + elapsed + " ms") ;
      System.out.println((tryAmount/elapsed) + " elements/ms") ;
      System.out.println("Last received via parameterized elements: " + lastReceived) ;
    } // for (tryAmount ...)
      }                    // else (if (false))

      // Done, delete our table
      //      statement.executeUpdate("DROP TABLE MILLION") ;
      conn.commit() ;
    }
    catch (Exception e) {
      e.printStackTrace() ;
    }
  }                    // main()
}                    // class INList

Re: IN list processing performance (yet again)

From

Shridhar Daithankar

Date:

28 May 2003, 09:17:44

On Wednesday 28 May 2003 18:21, Dave Tenny wrote:
> Having grepped the web, it's clear that this isn't the first or last
> time this issue will be raised.
>
> My application relies heavily on IN lists.  The lists are primarily
> constant integers, so queries look like:
>
> SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)

How do you derive this list of number? If it is from same database, can you
rewrite the query using a join statement?

HTH

 Shridhar

Re: IN list processing performance (yet again)

From

Andreas Pflug

Date:

28 May 2003, 09:19:21

Dave Tenny wrote:

> Having grepped the web, it's clear that this isn't the first or last
> time this issue will be raised.
>
> My application relies heavily on IN lists.  The lists are primarily
> constant integers, so queries look like:
>
> SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)
>
> Performance is critical, and the size of these lists depends a lot on
> how the larger 3-tier applicaiton is used,
> but it wouldn't be out of the question to retrieve 3000-10000 items.
>
> PostgreSQL 7.3.2 seems to have a lot of trouble with large lists.
> I ran an experiment that ran queries on a table of two integers  (ID,
> VAL), where ID is a primary key and the subject
> of IN list predicates.  The test used a table with one million rows
> ID is appropriately indexed,
> and I have VACUUMED/analyzed the database after table load.
>
> I ran tests on in-lists from about 100 to 100,000 entries.

Hi Dave,

it sounds as if that IN-list is created by the application. I wonder if
there are really so many variances and combinations of it or whether you
could invent an additional column, which groups all those individual
values. If possible, you could reduce your IN list to much fewer values,
and probably would get better performance (using an index on that col,
of course).

Regards,

Andreas

Re: IN list processing performance (yet again)

From

"Mario Weilguni"

Date:

28 May 2003, 10:31:47

>
> My application relies heavily on IN lists.  The lists are primarily
> constant integers, so queries look like:
>
> SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)
>
> Performance is critical, and the size of these lists depends a lot on
> how the larger 3-tier applicaiton is used,
> but it wouldn't be out of the question to retrieve 3000-10000 items.
>
> PostgreSQL 7.3.2 seems to have a lot of trouble with large lists.

you should rewrite your query if the query is created from an applition:

SELECT val
   FROM table
 WHERE id between 43 and 100002
       AND id IN (43, 49, 1001, 100002, ...)

where 43 is the min and 100002 the max of all values.

I had this case with postgresql 7.2 and the planner made much smarter
choices in my case.

Regards,
    Mario Weilguni

Re: IN list processing performance (yet again)

From

Stephan Szabo

Date:

28 May 2003, 10:54:36

On Wed, 28 May 2003, Dave Tenny wrote:

> Having grepped the web, it's clear that this isn't the first or last
> time this issue will be raised.
>
> My application relies heavily on IN lists.  The lists are primarily
> constant integers, so queries look like:
>
> SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)
>
> Performance is critical, and the size of these lists depends a lot on
> how the larger 3-tier applicaiton is used,
> but it wouldn't be out of the question to retrieve 3000-10000 items.
>
> PostgreSQL 7.3.2 seems to have a lot of trouble with large lists.

It gets converted into a sequence like col=list[0] or col=list[1] and it
seems the planner/optimizer is taking at least a large amount of time for
me given that explain takes just over 80 seconds for a 9900 item list on
my machine (I don't have a data filled table to run the actual query
against).

The best plan may be generated right now from making a temporary
table, copying the values into it, and joining.

> 2) What is the expected acceptable limit for the number of items in an
> IN list predicate such as
>       those used here.  (List of constants, not subselects).

As a note, 7.4 by default seems to limit it to 10000 unless you up
max_expr_depth afaics.

Re: IN list processing performance (yet again)

From

Dave Tenny

Date:

28 May 2003, 13:58:18

A join isn't an option, these elements come a a selection of entity ID's that are specific to some client context.
Some other people suggested joins too.

Consider it something like this, say there's a database that represents the entire file system content
of a set of machines, hundreds of thousands of files. A single user wants to do something
related to the ID's of 3000 files. The requests for those 3000 files can be built up in a number of ways,
not all of which rely on data in the database. So I need to retrieve data on those 3000 files using IN lists or some alternative.

Dave

Shridhar Daithankar wrote:

On Wednesday 28 May 2003 18:21, Dave Tenny wrote:

Having grepped the web, it's clear that this isn't the first or last
time this issue will be raised.

My application relies heavily on IN lists.  The lists are primarily
constant integers, so queries look like:

SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)

How do you derive this list of number? If it is from same database, can you 
rewrite the query using a join statement?

HTH
Shridhar


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Re: IN list processing performance (yet again)

From

Dave Tenny

Date:

28 May 2003, 14:08:06

Andreas Pflug wrote:

> Dave Tenny wrote:
>
>> Having grepped the web, it's clear that this isn't the first or last
>> time this issue will be raised.
>>
>> My application relies heavily on IN lists.  The lists are primarily
>> constant integers, so queries look like:
>>
>> SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)
>>
>> Performance is critical, and the size of these lists depends a lot on
>> how the larger 3-tier applicaiton is used,
>> but it wouldn't be out of the question to retrieve 3000-10000 items.
>>
>> PostgreSQL 7.3.2 seems to have a lot of trouble with large lists.
>> I ran an experiment that ran queries on a table of two integers  (ID,
>> VAL), where ID is a primary key and the subject
>> of IN list predicates.  The test used a table with one million rows
>> ID is appropriately indexed,
>> and I have VACUUMED/analyzed the database after table load.
>>
>> I ran tests on in-lists from about 100 to 100,000 entries.
>
>
> Hi Dave,
>
> it sounds as if that IN-list is created by the application. I wonder
> if there are really so many variances and combinations of it or
> whether you could invent an additional column, which groups all those
> individual values. If possible, you could reduce your IN list to much
> fewer values, and probably would get better performance (using an
> index on that col, of course).

There are over 50 tables in the schema,
and dozens of client commands that manipulate the schema in a
persistent-checkout kind of way over time, as well as spurious reporting
requests
that require incredibly complex filtering and combination of data from
many tables.
I'm pretty much up to my keister data (and am resisting impulses for
denormalization), so this approach
probably isn't viable for me.  Now I *could* create a temporary table
with the group of values, but I suspect the cost of that
substantially outweighs the negative performance of current IN lists or
parameterized statements.

I'm reminded to relay to the PostgreSQL devos that I might be able to do
more in the  join or subquery department if
PostgreSQL had better performing MAX functions and a FIRST function for
selecting rows from groups.
("Performing" being the operative word here, since the extensible
architecture of PostgreSQL currently makes for poorly
performing MAX capabilities and presumably similar user defined
aggregate functions).

>
> Regards,
>
> Andreas
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>

Re: IN list processing performance (yet again)

From

Dave Tenny

Date:

28 May 2003, 14:17:30

Mario Weilguni wrote:

My application relies heavily on IN lists.  The lists are primarily
constant integers, so queries look like:

SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)

Performance is critical, and the size of these lists depends a lot on
how the larger 3-tier applicaiton is used,
but it wouldn't be out of the question to retrieve 3000-10000 items.

PostgreSQL 7.3.2 seems to have a lot of trouble with large lists.

you should rewrite your query if the query is created from an applition:

SELECT val  FROM tableWHERE id between 43 and 100002      AND id IN (43, 49, 1001, 100002, ...)

where 43 is the min and 100002 the max of all values.

I had this case with postgresql 7.2 and the planner made much smarter
choices in my case.

Regards,   Mario Weilguni

Very interesting! I tried it out, but it didn't appreciably change the thresholds in my results for going by for IN list
sizes 100 - 1000. It's also likely to be of use only if the range for the between is fairly restricted,
which isn't necessarily characteristic of my data.

Dave

Re: IN list processing performance (yet again)

From

Mario Weilguni

Date:

28 May 2003, 14:30:05

> I'm reminded to relay to the PostgreSQL devos that I might be able to do
> more in the  join or subquery department if
> PostgreSQL had better performing MAX functions and a FIRST function for
> selecting rows from groups.
> ("Performing" being the operative word here, since the extensible
> architecture of PostgreSQL currently makes for poorly
> performing MAX capabilities and presumably similar user defined
> aggregate functions).

MIN/MAX is almost in every case replaceable:
select bar
   from foo
 order by bar limit 1;

instead of
select max(bar) from foo;

Re: IN list processing performance (yet again)

From

Bruno Wolff III

Date:

28 May 2003, 14:37:05

On Wed, May 28, 2003 at 14:08:02 -0400,
  Dave Tenny <tenny@attbi.com> wrote:
> Andreas Pflug wrote:
>
> I'm reminded to relay to the PostgreSQL devos that I might be able to do
> more in the  join or subquery department if
> PostgreSQL had better performing MAX functions and a FIRST function for
> selecting rows from groups.
> ("Performing" being the operative word here, since the extensible
> architecture of PostgreSQL currently makes for poorly
> performing MAX capabilities and presumably similar user defined
> aggregate functions).

Have you tried replacing max with a subselect that uses order by and limit?

Re: IN list processing performance (yet again)

From

Bruno Wolff III

Date:

28 May 2003, 14:39:43

On Wed, May 28, 2003 at 13:58:14 -0400,
  Dave Tenny <tenny@attbi.com> wrote:
> A join isn't an option, these elements come a a selection of entity ID's
> that are specific to some client context.
> Some other people suggested joins too.

You can union the values together and then join (or use where exists) with the
result. This may not be faster and you may not be able to union several
thousand selects together in a single statement. But it shouldn't be too
much work to test it out.

Re: IN list processing performance (yet again)

From

Dave Tenny

Date:

28 May 2003, 15:57:29

Mario Weilguni wrote:

I'm reminded to relay to the PostgreSQL devos that I might be able to do
more in the  join or subquery department if
PostgreSQL had better performing MAX functions and a FIRST function for
selecting rows from groups.
("Performing" being the operative word here, since the extensible
architecture of PostgreSQL currently makes for poorly
performing MAX capabilities and presumably similar user defined
aggregate functions).

MIN/MAX is almost in every case replaceable:
select bar  from fooorder by bar limit 1;

instead of
select max(bar) from foo;

Yup, been there, done that, but thanks, it's a good tidbit for the postgresql unaware.

There are some places however where it doesn't work well in query logic, though I don't have an example off the top of my head
since I've worked around it in all my queries.

Re: IN list processing performance (yet again)

From

Dave Tenny

Date:

28 May 2003, 16:01:40

Bruno Wolff III wrote:

On Wed, May 28, 2003 at 14:08:02 -0400, Dave Tenny <tenny@attbi.com> wrote:

Andreas Pflug wrote:

I'm reminded to relay to the PostgreSQL devos that I might be able to do 
more in the  join or subquery department if
PostgreSQL had better performing MAX functions and a FIRST function for 
selecting rows from groups.
("Performing" being the operative word here, since the extensible 
architecture of PostgreSQL currently makes for poorly
performing MAX capabilities and presumably similar user defined 
aggregate functions).

Have you tried replacing max with a subselect that uses order by and limit?

I'm uncertain how that would work, since somewhere in there I still need to elaborate on the
1000 items I want, and they're not necessarily in any particular range, nor do they bear any
contiguous group nature.

Also, IN (subquery) is a known performance problem in PGSQL, at least if the subquery is going to return many rows.
It's too bad, since I'm rather fond of subqueries, but I avoid them like the plague in PostgreSQL.

Perhaps I don't understand what you had in mind.

Re: IN list processing performance (yet again)

From

Dave Tenny

Date:

28 May 2003, 16:13:22

Bruno Wolff III wrote:

On Wed, May 28, 2003 at 13:58:14 -0400, Dave Tenny <tenny@attbi.com> wrote:

A join isn't an option, these elements come a a selection of entity ID's 
that are specific to some client context.
Some other people suggested joins too.

You can union the values together and then join (or use where exists) with the
result. This may not be faster and you may not be able to union several
thousand selects together in a single statement. But it shouldn't be too
much work to test it out.

I assume you mean something like:

test=# select million.id, million.val from million, (select 10000 as a union select 20000 as a) t2 where million.id = t2.a;
id   | val
-------+-------
10000 |     0
20000 | 10000
(2 rows)

Ouch! That's deviant.   Haven't tried it yet and I cringe at the thought of it, but I might take a run at it. However that's going to
run up the buffer space quickly. That was one of my as yet unsnaswered questions, what is the pragmatic buffer size limit
for queries?

I'm really hoping we'll come up with something better, like an understanding of why IN lists are non-linear in the first
place when the column is indexed, and whether it's fixable through some other means or whether it's a bug that should be fixed.

After all, I'm trying to support multiple databases, and other databases kick butt on this. It's just postgresql that's
having difficulty.

(Btw, I've also tried statement batching, but that's a lose for now, at least with the current JDBC drivers and 7.3.2).

Re: IN list processing performance (yet again)

From

Bruno Wolff III

Date:

28 May 2003, 16:22:10

On Wed, May 28, 2003 at 16:01:34 -0400,
  Dave Tenny <tenny@attbi.com> wrote:
>
> Perhaps I don't understand what you had in mind.
>

I was refering to your comment about max causing problems. But it seems
you are aware of the standard work around.

Re: IN list processing performance (yet again)

From

Bruno Wolff III

Date:

28 May 2003, 16:26:57

On Wed, May 28, 2003 at 16:13:17 -0400,
  Dave Tenny <tenny@attbi.com> wrote:
> Bruno Wolff III wrote:
>
> I assume you mean something like:
>
> test=# select million.id, million.val from million, (select 10000 as a
> union select 20000 as a) t2 where million.id = t2.a;
>  id   |  val
> -------+-------
> 10000 |     0
> 20000 | 10000
> (2 rows)
>
> Ouch!  That's deviant.   Haven't tried it yet and I cringe at the
> thought of it, but I might take a run at it.  However that's going to
> run up the buffer space quickly.  That was one of my as yet unsnaswered
> questions, what is the pragmatic buffer size limit
> for queries?

That is what I was referring to. I have used this in some cases where
I knew the list was small and I wanted to do a set difference without
loading a temporary table. Or to do an insert of multiple rows with
one insert statement.

> I'm /really/ hoping we'll come up with something better, like an
> understanding of why IN lists are non-linear in the first
> place when the column is indexed, and whether it's fixable through some
> other means or whether it's a bug that should be fixed.

It also might be worth seeing if the development version is going to
speed things up for you. Beta is one month away. My guess is that the
production release will be in September.

Re: IN list processing performance (yet again)

From

Tom Lane

Date:

28 May 2003, 19:44:06

Dave Tenny <tenny@attbi.com> writes:
> My application relies heavily on IN lists.  The lists are primarily
> constant integers, so queries look like:
> SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)

> 1) PostgreSQL exhibits worse-than-linear performance behavior with
> respect to IN list size.

Yeah.  There are a couple of places in the planner that have O(N^2)
behavior on sufficiently large WHERE clauses, due to building lists
in a naive way (repeated lappend() operations).  The inner loop is
very tight, but nonetheless when you do it tens of millions of times
it adds up :-(

I have just committed some fixes into CVS tip for this --- I see about
a 10x speedup in planning time on test cases involving 10000-OR-item
WHERE clauses.  We looked at this once before; the test cases I'm using
actually date back to Jan 2000.  But it seems some slowness has crept
in due to subsequent planning improvements.

> 4)  God help you if you haven't vacuum/analyzed that the newly loaded
> table.

Without knowledge that the id field is unique, the planner is likely to
tilt away from an indexscan with not too many IN items.  I don't
consider this a bug.

>       PostgreSQL  craps out trying to process 8000 elements with the error:
>       out of free buffers: time to abort!

This is a known bug in 7.3.2; it's fixed in 7.3.3.

            regards, tom lane

Re: IN list processing performance (yet again)

From

Dave Tenny

Date:

28 May 2003, 21:19:59

Tom Lane wrote:

Dave Tenny <tenny@attbi.com> writes:

My application relies heavily on IN lists.  The lists are primarily 
constant integers, so queries look like:
SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)

1) PostgreSQL exhibits worse-than-linear performance behavior with 
respect to IN list size.

Yeah.  There are a couple of places in the planner that have O(N^2)
behavior on sufficiently large WHERE clauses, due to building lists
in a naive way (repeated lappend() operations).  The inner loop is
very tight, but nonetheless when you do it tens of millions of times
it adds up :-(

It was also showing up very clearly in the timings I attached for just a couple hundred IN list entries too.
Does that mean that the time for the O(1) portions of the logic are perhaps also in need of a tuneup?
(I would think O(N^2) for trivial operations on a fast machine wouldn't manifest quite so much
for smallish N).

I have just committed some fixes into CVS tip for this --- I see about
a 10x speedup in planning time on test cases involving 10000-OR-item
WHERE clauses.  We looked at this once before; the test cases I'm using
actually date back to Jan 2000.  But it seems some slowness has crept
in due to subsequent planning improvements.

So what version might that equate to down the road, so I can be sure to check it out?
I'm afraid I'm not up to testing CVS tips.

4)  God help you if you haven't vacuum/analyzed that the newly loaded 
table.

Without knowledge that the id field is unique, the planner is likely to
tilt away from an indexscan with not too many IN items.  I don't
consider this a bug.

There is one very interesting thing in my test case though. It certainly seemed as if the
parameterized statements were successfully using the index of the freshly-created-but-unanalyzed table,
or else the times on those queries would have been terrible too. It was only the IN list form
of query that wasn't making correct use of the index. How can the planner recognize uniqueness for
one case but not the other? (Since I ran both forms of query on the same table with and without vacuuming).

      PostgreSQL  craps out trying to process 8000 elements with the error:     out of free buffers: time to abort!

This is a known bug in 7.3.2; it's fixed in 7.3.3.

Thanks, that's good to know.

Dave

Re: IN list processing performance (yet again)

From

Tom Lane

Date:

28 May 2003, 21:46:21

Dave Tenny <tenny@attbi.com> writes:
> </blockquote>
> There is one very interesting thing in my test case though.  It
> certainly <i>seemed</i> as if the<br>
> parameterized statements were successfully using the index of the
> freshly-created-but-unanalyzed table,<br>
> or else the times on those queries would have been terrible too.  It
> was only the IN list form<br>
> of query that wasn't making correct use of the index.  How can the
> planner recognize uniqueness for<br>
> one case but not the other?

The question is whether a seqscan will be faster than an indexscan; at
some point there's less I/O involved to just scan the table once.  If
the planner doesn't know the index is unique then it's going to estimate
a higher cost for the indexscan (due to more rows fetched) and there is
some number of rows at which it will flip over to a seqscan.  The same
will happen even if it *does* know the index is unique, it's just that
it will take more IN elements to make it happen.  This is reasonable
behavior IMHO, although whether the flip-over point is anywhere near
the actual breakeven point on your hardware is anyone's guess.  The cost
estimates are often far enough off that it's not very close.

            regards, tom lane

Re: IN list processing performance (yet again)

From

"Christopher Kings-Lynne"

Date:

28 May 2003, 21:47:50

The IN list processing has been fixed in 7.4CVS. It now uses a hash based lookup rather than a list, so it's vastly faster.

Chris

----- Original Message -----
From: Dave Tenny
To: Shridhar Daithankar
Cc: pgsql-performance@postgresql.org
Sent: Thursday, May 29, 2003 1:58 AM
Subject: Re: [PERFORM] IN list processing performance (yet again)

A join isn't an option, these elements come a a selection of entity ID's that are specific to some client context.
Some other people suggested joins too.

Consider it something like this, say there's a database that represents the entire file system content
of a set of machines, hundreds of thousands of files. A single user wants to do something
related to the ID's of 3000 files. The requests for those 3000 files can be built up in a number of ways,
not all of which rely on data in the database. So I need to retrieve data on those 3000 files using IN lists or some alternative.

Dave

Shridhar Daithankar wrote:
On Wednesday 28 May 2003 18:21, Dave Tenny wrote: 
Having grepped the web, it's clear that this isn't the first or last
time this issue will be raised.

My application relies heavily on IN lists.  The lists are primarily
constant integers, so queries look like:

SELECT val FROM table WHERE id IN (43, 49, 1001, 100002, ...)   
How do you derive this list of number? If it is from same database, can you 
rewrite the query using a join statement?

HTH
Shridhar


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
 

Re: IN list processing performance (yet again)

From

"Christopher Kings-Lynne"

Date:

28 May 2003, 21:53:54

> Also, IN (subquery) is a known performance problem in PGSQL, at least if
the subquery is going to return > many rows.
> It's too bad, since I'm rather fond of subqueries, but I avoid them like
the plague in PostgreSQL.

You're not really using a subquery - really just a long list of integers.
Subqueries are lightning fast, so long as you conver to the EXISTS form:

SELECT * FROM tab WHERE id IN (SELECT id2 FROM tab2);

converts to:

SELECT * FROM tab WHERE EXISTS (SELECT id2 FROM tab2 WHERE id2=id);

Chris

Re: IN list processing performance (yet again)

From

Bruno Wolff III

Date:

29 May 2003, 06:53:24

On Wed, May 28, 2003 at 21:19:56 -0400,
  Dave Tenny <tenny@attbi.com> wrote:
> So what version might that equate to down the road, so I can be sure to
> check it out?
> I'm afraid I'm not up to testing CVS tips.

7.4

7.4 is suppsoed to go into beta July 1.

If you are used to installing from source tarballs, you can grab a
development snapshot tarball. There is probably a day or two delay
depending on if you get it from a mirror or the primary site.

Re: IN list processing performance (yet again)

From

Dave Tenny

Date:

29 May 2003, 08:48:46

Christopher Kings-Lynne wrote:

Also, IN (subquery) is a known performance problem in PGSQL, at least if

the subquery is going to return > many rows.

It's too bad, since I'm rather fond of subqueries, but I avoid them like

the plague in PostgreSQL.

You're not really using a subquery - really just a long list of integers.

Oops, you got that out of context, it was a different piece of conversation about subqueries in IN predicate,
not the scalar forms that was my overall discussion point. You're right, I'm using lists of integers,
someone else was suggesting using subqueries in some context and I was responding to that.

Subqueries are lightning fast, so long as you conver to the EXISTS form:

SELECT * FROM tab WHERE id IN (SELECT id2 FROM tab2);

converts to:

SELECT * FROM tab WHERE EXISTS (SELECT id2 FROM tab2 WHERE id2=id);

Chris

I hadn't thought of that, it's an excellent tip. I'll have to remember it next time I want to use subqueries.
(Again, it's a side topic, my primary concern is scalar-form IN lists.)

Thanks,

Dave