Thread: Join with lower/upper limits doesn't scale well
I have the same schema in two different databases. In "smalldb", the two tables of interest have about 430,000 rows, in"bigdb", the two tables each contain about 5.5 million rows. I'm processing the data, and for various reasons it worksout well to process it in 100,000 row chunks. However, it turns out for the big schema, selecting 100,000 rows is thelongest single step of the processing. Below is the explain/analyze output of the query from each database. Since both tables are indexed on the joined columns,I don't understand why the big table should be so much slower -- I hoped this would scale well, or at least O(log(N)),not O(N). What's going on here? I don't know if I'm reading this right, but it looks like the sort is taking all the time, but thatdoesn't make sense because in both cases it's sorting 100,000 rows. Thanks, Craig bigdb=> explain analyze bigdb-> select r.row_num, m.molkeys from my_rownum r bigdb-> join my_molkeys m on (r.version_id = m.version_id) bigdb-> where r.row_num >= 100000 AND r.row_num < 200000 bigdb-> order by r.row_num; Sort (cost=431000.85..431248.23 rows=98951 width=363) (actual time=46306.748..46417.448 rows=100000 loops=1) Sort Key: r.row_num -> Hash Join (cost=2583.59..422790.68 rows=98951 width=363) (actual time=469.010..45752.131 rows=100000 loops=1) Hash Cond: ("outer".version_id = "inner".version_id) -> Seq Scan on my_molkeys m (cost=0.00..323448.30 rows=5472530 width=363) (actual time=11.243..33299.933 rows=5472532loops=1) -> Hash (cost=2336.21..2336.21 rows=98951 width=8) (actual time=442.260..442.260 rows=100000 loops=1) -> Index Scan using i_chm_rownum_row_num on my_rownum r (cost=0.00..2336.21 rows=98951 width=8) (actualtime=47.551..278.736 rows=100000 loops=1) Index Cond: ((row_num >= 100000) AND (row_num < 200000)) Total runtime: 46543.163 ms smalldb=> explain analyze smalldb-> select r.row_num, m.molkeys from my_rownum r smalldb-> join my_molkeys m on (r.version_id = m.version_id) smalldb-> where r.row_num >= 100000 AND r.row_num < 200000 smalldb-> order by r.row_num; Sort (cost=43598.23..43853.38 rows=102059 width=295) (actual time=4097.180..4207.733 rows=100000 loops=1) Sort Key: r.row_num -> Hash Join (cost=2665.09..35107.41 rows=102059 width=295) (actual time=411.635..3629.756 rows=100000 loops=1) Hash Cond: ("outer".version_id = "inner".version_id) -> Seq Scan on my_molkeys m (cost=0.00..23378.90 rows=459590 width=295) (actual time=8.563..2011.455 rows=459590loops=1) -> Hash (cost=2409.95..2409.95 rows=102059 width=8) (actual time=402.867..402.867 rows=100000 loops=1) -> Index Scan using i_chm_rownum_row_num_8525 on my_rownum r (cost=0.00..2409.95 rows=102059 width=8) (actualtime=37.122..242.528 rows=100000 loops=1) Index Cond: ((row_num >= 100000) AND (row_num < 200000)) Total runtime: 4333.501 ms Table "bigdb.my_rownum" Column | Type | Modifiers ------------+---------+----------- version_id | integer | parent_id | integer | row_num | integer | Indexes: "i_chm_rownum_row_num" UNIQUE, btree (row_num) "i_chm_rownum_version_id" UNIQUE, btree (version_id) "i_chm_rownum_parent_id" btree (parent_id) Table "bigdb.my_molkeys" Column | Type | Modifiers ------------+---------+----------- version_id | integer | molkeys | text | Indexes: "i_chm_molkeys_version_id" UNIQUE, btree (version_id)
"Craig James" <craig_james@emolecules.com> writes: > Below is the explain/analyze output of the query from each database. Since > both tables are indexed on the joined columns, I don't understand why the > big table should be so much slower -- I hoped this would scale well, or at > least O(log(N)), not O(N). ... > Sort (cost=431000.85..431248.23 rows=98951 width=363) (actual time=46306.748..46417.448 rows=100000 loops=1) > Sort Key: r.row_num > -> Hash Join (cost=2583.59..422790.68 rows=98951 width=363) (actual time=469.010..45752.131 rows=100000 loops=1) > Hash Cond: ("outer".version_id = "inner".version_id) > -> Seq Scan on my_molkeys m (cost=0.00..323448.30 rows=5472530 width=363) (actual time=11.243..33299.933 rows=5472532loops=1) > -> Hash (cost=2336.21..2336.21 rows=98951 width=8) (actual time=442.260..442.260 rows=100000 loops=1) > -> Index Scan using i_chm_rownum_row_num on my_rownum r (cost=0.00..2336.21 rows=98951 width=8) (actualtime=47.551..278.736 rows=100000 loops=1) > Index Cond: ((row_num >= 100000) AND (row_num < 200000)) > Total runtime: 46543.163 ms It looks like most of the time is being spent in the sequential scan of the my_molkeys at least 33 seconds out of 46 seconds is. The actual sort is taking under a second (the hash join finishes after 45.7s and the sort finishes after 46.4s). The rest of the time (about 13s) is actually being spent in the hash join which makes me think it's overflowing work_mem and having to process the join in two batches. You might be able to speed it up by raising work_mem for this query (you can set work_mem locally using SET) The row_num where clause only narrows down the set of rows coming from the my_rownum table. If you want sublinear performance you would have to provide some way for Postgres to narrow down the rows from my_molkeys without actually having to read them all in. With the query as is the only way I can see that happening would be if you had an index on "my_molkey(version_id)" and "my_rownum(version_id) WHERE row_num between 100000 and 200000". Then it could do a merge join between two index scans. Note than even then I'm surprised the optimizer is bothering with the index for these queries, at least for the 400k case. Do you have enable_seqscan=off or random_page_cost dialled way down? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
The two queries below produce different plans. select r.version_id, r.row_num, m.molkeys from my_rownum r join my_molkeys m on (r.version_id = m.version_id) where r.version_id >= 3200000 and r.version_id < 3300000 order by r.version_id; select r.version_id, r.row_num, m.molkeys from my_rownum r join my_molkeys m on (r.version_id = m.version_id) where r.version_id >= 3200000 and r.version_id < 3300000 and m.version_id >= 3200000 and m.version_id < 3300000 order by r.version_id; I discovered this while looking at the plans for the first query. It seemed to be ignoring the fact that it could push the"between" condition along to the second table, since the condition and the join are on the same indexed columns. So,I added a redundant condition, and bingo, it was a lot faster. In the analysis shown below, the timing (about 1.0 and1.5 seconds respectively) are for a "hot" database that's been queried a couple of times. In real life on a "cold" database,the times are more like 10 seconds and 21 seconds, so it's quite significant. Thanks, Craig db=> explain analyze db-> select r.version_id, r.row_num, m.molkeys from my_rownum r db-> join my_molkeys m on (r.version_id = m.version_id) db-> where r.version_id >= 3200000 db-> and r.version_id < 3300000 db-> order by r.version_id; Sort (cost=264979.51..265091.06 rows=44620 width=366) (actual time=1424.126..1476.048 rows=46947 loops=1) Sort Key: r.version_id -> Nested Loop (cost=366.72..261533.64 rows=44620 width=366) (actual time=41.649..1186.331 rows=46947 loops=1) -> Bitmap Heap Scan on my_rownum r (cost=366.72..41168.37 rows=44620 width=8) (actual time=41.616..431.783 rows=46947loops=1) Recheck Cond: ((version_id >= 3200000) AND (version_id < 3300000)) -> Bitmap Index Scan on i_chm_rownum_version_id_4998 (cost=0.00..366.72 rows=44620 width=0) (actual time=21.244..21.244rows=46947 loops=1) Index Cond: ((version_id >= 3200000) AND (version_id < 3300000)) -> Index Scan using i_chm_molkeys_version_id on my_molkeys m (cost=0.00..4.93 rows=1 width=362) (actual time=0.009..0.010rows=1 loops=46947) Index Cond: ("outer".version_id = m.version_id) Total runtime: 1534.638 ms (10 rows) db=> explain analyze db-> select r.version_id, r.row_num, m.molkeys from my_rownum r db-> join my_molkeys m on (r.version_id = m.version_id) db-> where r.version_id >= 3200000 db-> and r.version_id < 3300000 db-> and m.version_id >= 3200000 db-> and m.version_id < 3300000 db-> order by r.version_id; Sort (cost=157732.20..157732.95 rows=298 width=366) (actual time=985.383..1037.423 rows=46947 loops=1) Sort Key: r.version_id -> Hash Join (cost=41279.92..157719.95 rows=298 width=366) (actual time=502.875..805.402 rows=46947 loops=1) Hash Cond: ("outer".version_id = "inner".version_id) -> Index Scan using i_chm_molkeys_version_id on my_molkeys m (cost=0.00..115717.85 rows=47947 width=362) (actualtime=0.023..117.270 rows=46947 loops=1) Index Cond: ((version_id >= 3200000) AND (version_id < 3300000)) -> Hash (cost=41168.37..41168.37 rows=44620 width=8) (actual time=502.813..502.813 rows=46947 loops=1) -> Bitmap Heap Scan on my_rownum r (cost=366.72..41168.37 rows=44620 width=8) (actual time=41.621..417.508rows=46947 loops=1) Recheck Cond: ((version_id >= 3200000) AND (version_id < 3300000)) -> Bitmap Index Scan on i_chm_rownum_version_id_4998 (cost=0.00..366.72 rows=44620 width=0) (actualtime=21.174..21.174 rows=46947 loops=1) Index Cond: ((version_id >= 3200000) AND (version_id < 3300000)) Total runtime: 1096.031 ms (12 rows)
Sorry, I forgot to mention: This is 8.1.4, with a fairly ordinary configuration on a 4 GB system. Craig Craig James wrote: > The two queries below produce different plans. > > select r.version_id, r.row_num, m.molkeys from my_rownum r > join my_molkeys m on (r.version_id = m.version_id) > where r.version_id >= 3200000 > and r.version_id < 3300000 > order by r.version_id; > > > select r.version_id, r.row_num, m.molkeys from my_rownum r > join my_molkeys m on (r.version_id = m.version_id) > where r.version_id >= 3200000 > and r.version_id < 3300000 > and m.version_id >= 3200000 > and m.version_id < 3300000 > order by r.version_id; > > I discovered this while looking at the plans for the first query. It > seemed to be ignoring the fact that it could push the "between" > condition along to the second table, since the condition and the join > are on the same indexed columns. So, I added a redundant condition, and > bingo, it was a lot faster. In the analysis shown below, the timing > (about 1.0 and 1.5 seconds respectively) are for a "hot" database that's > been queried a couple of times. In real life on a "cold" database, the > times are more like 10 seconds and 21 seconds, so it's quite significant. > > Thanks, > Craig > > > > db=> explain analyze db-> select r.version_id, r.row_num, m.molkeys from > my_rownum r > db-> join my_molkeys m on (r.version_id = m.version_id) > db-> where r.version_id >= 3200000 > db-> and r.version_id < 3300000 > db-> order by r.version_id; > > Sort (cost=264979.51..265091.06 rows=44620 width=366) (actual > time=1424.126..1476.048 rows=46947 loops=1) > Sort Key: r.version_id > -> Nested Loop (cost=366.72..261533.64 rows=44620 width=366) (actual > time=41.649..1186.331 rows=46947 loops=1) > -> Bitmap Heap Scan on my_rownum r (cost=366.72..41168.37 > rows=44620 width=8) (actual time=41.616..431.783 rows=46947 loops=1) > Recheck Cond: ((version_id >= 3200000) AND (version_id < > 3300000)) > -> Bitmap Index Scan on i_chm_rownum_version_id_4998 > (cost=0.00..366.72 rows=44620 width=0) (actual time=21.244..21.244 > rows=46947 loops=1) > Index Cond: ((version_id >= 3200000) AND (version_id > < 3300000)) > -> Index Scan using i_chm_molkeys_version_id on my_molkeys m > (cost=0.00..4.93 rows=1 width=362) (actual time=0.009..0.010 rows=1 > loops=46947) > Index Cond: ("outer".version_id = m.version_id) > Total runtime: 1534.638 ms > (10 rows) > > > db=> explain analyze db-> select r.version_id, r.row_num, m.molkeys from > my_rownum r > db-> join my_molkeys m on (r.version_id = m.version_id) > db-> where r.version_id >= 3200000 > db-> and r.version_id < 3300000 > db-> and m.version_id >= 3200000 > db-> and m.version_id < 3300000 > db-> order by r.version_id; > > Sort (cost=157732.20..157732.95 rows=298 width=366) (actual > time=985.383..1037.423 rows=46947 loops=1) > Sort Key: r.version_id > -> Hash Join (cost=41279.92..157719.95 rows=298 width=366) (actual > time=502.875..805.402 rows=46947 loops=1) > Hash Cond: ("outer".version_id = "inner".version_id) > -> Index Scan using i_chm_molkeys_version_id on my_molkeys m > (cost=0.00..115717.85 rows=47947 width=362) (actual time=0.023..117.270 > rows=46947 loops=1) > Index Cond: ((version_id >= 3200000) AND (version_id < > 3300000)) > -> Hash (cost=41168.37..41168.37 rows=44620 width=8) (actual > time=502.813..502.813 rows=46947 loops=1) > -> Bitmap Heap Scan on my_rownum r > (cost=366.72..41168.37 rows=44620 width=8) (actual time=41.621..417.508 > rows=46947 loops=1) > Recheck Cond: ((version_id >= 3200000) AND > (version_id < 3300000)) > -> Bitmap Index Scan on > i_chm_rownum_version_id_4998 (cost=0.00..366.72 rows=44620 width=0) > (actual time=21.174..21.174 rows=46947 loops=1) > Index Cond: ((version_id >= 3200000) AND > (version_id < 3300000)) > Total runtime: 1096.031 ms > (12 rows) >
Craig James <craig_james@emolecules.com> writes: > The two queries below produce different plans. > select r.version_id, r.row_num, m.molkeys from my_rownum r > join my_molkeys m on (r.version_id = m.version_id) > where r.version_id >= 3200000 > and r.version_id < 3300000 > order by r.version_id; > select r.version_id, r.row_num, m.molkeys from my_rownum r > join my_molkeys m on (r.version_id = m.version_id) > where r.version_id >= 3200000 > and r.version_id < 3300000 > and m.version_id >= 3200000 > and m.version_id < 3300000 > order by r.version_id; Yeah, the planner does not make any attempt to infer implied inequalities, so it will not generate the last two clauses for you. There is machinery in there to infer implied *equalities*, which is cheaper (fewer operators to consider) and much more useful across typical queries such as multiway joins on the same keys. I'm pretty dubious that it'd be worth the cycles to search for implied inequalities. regards, tom lane
Here's an oddity. I have 10 databases, each with about a dozen connections to Postgres (about 120 connections total), andat midnight they're all idle. These are mod_perl programs (like a FastCGI -- they stay connected so they're ready forinstant service). So using "ps -ef" and grep, we find one of the databases looks like this: postgres 22708 7619 0 Jul11 ? 00:00:06 postgres: chemmega chemmega 192.168.10.192(46915) idle postgres 22709 7619 0 Jul11 ? 00:00:06 postgres: chemmega chemmega 192.168.10.192(46916) idle postgres 22710 7619 0 Jul11 ? 00:00:06 postgres: chemmega chemmega 192.168.10.192(46917) idle postgres 22711 7619 0 Jul11 ? 00:00:06 postgres: chemmega chemmega 192.168.10.192(46918) idle postgres 22712 7619 0 Jul11 ? 00:00:06 postgres: chemmega chemmega 192.168.10.192(46919) idle postgres 22724 7619 0 Jul11 ? 00:00:06 postgres: chemmega chemmega 192.168.10.192(42440) idle postgres 22725 7619 0 Jul11 ? 00:00:06 postgres: chemmega chemmega 192.168.10.192(42441) idle postgres 22726 7619 0 Jul11 ? 00:00:06 postgres: chemmega chemmega 192.168.10.192(42442) idle postgres 22727 7619 0 Jul11 ? 00:00:06 postgres: chemmega chemmega 192.168.10.192(42443) idle postgres 22728 7619 0 Jul11 ? 00:00:06 postgres: chemmega chemmega 192.168.10.192(42444) idle postgres 22731 7619 0 Jul11 ? 00:00:06 postgres: chemmega chemmega 192.168.10.192(42447) idle Now here's the weird thing. I'm running a pg_restore of a database (on the order of 4GB compressed, maybe 34M rows of ordinarydata, and 15M rows in one BLOB table that's typically 2K per blob). When I do this, ALL of the postgress backendsstart working at about 1% CPU apiece. This means that the 120 "idle" postgres backends are together using almost100% of one CPU on top of the 100% CPU being used by pg_restore. See the output of top(1) below. Is this normal? All I can guess at is that something's going on in shared memory that every Postgres backend has to respondto. Thanks, Craig Tasks: 305 total, 1 running, 304 sleeping, 0 stopped, 0 zombie Cpu(s): 33.5% us, 1.5% sy, 0.0% ni, 57.8% id, 6.6% wa, 0.2% hi, 0.4% si Mem: 4151456k total, 4011020k used, 140436k free, 10096k buffers Swap: 2104504k total, 94136k used, 2010368k free, 3168596k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6681 postgres 16 0 217m 188m 161m D 50.4 4.6 4:29.30 postmaster 1577 root 10 -5 0 0 0 S 1.0 0.0 108:01.97 md0_raid1 8487 postgres 15 0 187m 8704 4996 S 1.0 0.2 0:06.56 postmaster 8506 postgres 15 0 187m 8604 4892 S 1.0 0.2 0:06.37 postmaster 8507 postgres 15 0 187m 8708 5004 S 1.0 0.2 0:06.42 postmaster 8512 postgres 15 0 187m 8612 4904 S 1.0 0.2 0:06.65 postmaster 8751 postgres 15 0 187m 10m 7520 S 1.0 0.3 0:07.95 postmaster 8752 postgres 15 0 187m 10m 7492 S 1.0 0.3 0:07.84 postmaster 14053 postgres 15 0 187m 8752 5044 S 1.0 0.2 0:06.53 postmaster 16515 postgres 15 0 187m 8156 4452 S 1.0 0.2 0:06.33 postmaster 25351 postgres 15 0 187m 9772 6064 S 1.0 0.2 0:06.75 postmaster 25387 postgres 15 0 187m 8444 4752 S 1.0 0.2 0:06.45 postmaster 25425 postgres 15 0 187m 9.8m 6340 S 1.0 0.2 0:06.75 postmaster 30626 postgres 15 0 187m 8472 4792 S 1.0 0.2 0:06.52 postmaster 30628 postgres 15 0 187m 8536 4840 S 1.0 0.2 0:06.50 postmaster 30630 postgres 15 0 187m 8524 4844 S 1.0 0.2 0:06.49 postmaster 30637 postgres 15 0 187m 8692 4880 S 1.0 0.2 0:06.25 postmaster 31679 postgres 15 0 187m 8544 4860 S 1.0 0.2 0:06.39 postmaster 31681 postgres 15 0 187m 8528 4848 S 1.0 0.2 0:06.25 postmaster 1751 postgres 15 0 187m 8432 4748 S 1.0 0.2 0:06.26 postmaster 11620 postgres 15 0 187m 8344 4644 S 1.0 0.2 0:06.23 postmaster 11654 postgres 15 0 187m 8316 4624 S 1.0 0.2 0:06.36 postmaster 19173 postgres 15 0 187m 9372 5668 S 1.0 0.2 0:06.49 postmaster 19670 postgres 15 0 187m 9236 5528 S 1.0 0.2 0:06.29 postmaster 20380 postgres 15 0 187m 8656 4956 S 1.0 0.2 0:06.20 postmaster 20649 postgres 15 0 187m 8280 4584 S 1.0 0.2 0:06.16 postmaster 22731 postgres 15 0 187m 8408 4700 S 1.0 0.2 0:06.03 postmaster 11045 postgres 15 0 185m 71m 68m S 0.7 1.8 0:19.35 postmaster 6408 postgres 15 0 187m 11m 7520 S 0.7 0.3 0:07.89 postmaster 6410 postgres 15 0 187m 10m 7348 S 0.7 0.3 0:07.53 postmaster 6411 postgres 15 0 187m 10m 7380 S 0.7 0.3 0:07.83 postmaster 6904 postgres 15 0 187m 8644 4788 S 0.7 0.2 0:06.15 postmaster 6905 postgres 15 0 187m 8288 4596 S 0.7 0.2 0:06.15 postmaster 6906 postgres 15 0 187m 8488 4764 S 0.7 0.2 0:06.18 postmaster 6907 postgres 15 0 187m 8580 4856 S 0.7 0.2 0:06.37 postmaster 7049 postgres 15 0 187m 8488 4800 S 0.7 0.2 0:06.07 postmaster 7054 postgres 15 0 187m 8376 4672 S 0.7 0.2 0:06.28 postmaster 7188 postgres 15 0 187m 8588 4868 S 0.7 0.2 0:06.39 postmaster 7190 postgres 15 0 187m 8832 5120 S 0.7 0.2 0:06.52 postmaster 7191 postgres 15 0 187m 8632 4916 S 0.7 0.2 0:06.48 postmaster 7192 postgres 15 0 187m 8884 5176 S 0.7 0.2 0:06.55 postmaster 8511 postgres 15 0 187m 8612 4904 S 0.7 0.2 0:06.39 postmaster 8513 postgres 15 0 187m 8776 5064 S 0.7 0.2 0:06.60 postmaster 8750 postgres 15 0 187m 10m 7220 S 0.7 0.3 0:07.72 postmaster 8768 postgres 15 0 187m 10m 7508 S 0.7 0.3 0:07.77 postmaster 8769 postgres 15 0 187m 10m 7448 S 0.7 0.3 0:07.81 postmaster 8775 postgres 15 0 187m 10m 7064 S 0.7 0.3 0:07.72 postmaster 8782 postgres 15 0 187m 10m 7316 S 0.7 0.3 0:07.84 postmaster 13947 postgres 15 0 187m 8500 4780 S 0.7 0.2 0:06.36 postmaster 13949 postgres 15 0 187m 8536 4824 S 0.7 0.2 0:06.36 postmaster 13951 postgres 15 0 187m 8504 4804 S 0.7 0.2 0:06.35 postmaster 14041 postgres 15 0 187m 8548 4828 S 0.7 0.2 0:06.45 postmaster 14046 postgres 15 0 187m 8560 4812 S 0.7 0.2 0:06.39 postmaster 14052 postgres 15 0 187m 8744 5024 S 0.7 0.2 0:06.54 postmaster 14055 postgres 15 0 187m 8580 4868 S 0.7 0.2 0:06.52 postmaster 14061 postgres 15 0 187m 8464 4760 S 0.7 0.2 0:06.45 postmaster 14092 postgres 15 0 187m 8624 4920 S 0.7 0.2 0:06.52 postmaster 16358 postgres 15 0 187m 8284 4596 S 0.7 0.2 0:06.54 postmaster 16367 postgres 15 0 187m 8392 4568 S 0.7 0.2 0:06.24 postmaster
Craig James <craig_james@emolecules.com> writes: > Now here's the weird thing. I'm running a pg_restore of a database > (on the order of 4GB compressed, maybe 34M rows of ordinary data, and > 15M rows in one BLOB table that's typically 2K per blob). When I do > this, ALL of the postgress backends start working at about 1% CPU > apiece. It's not surprising that they'd all start eating some CPU, if that's a schema restore and not just bulk data loading. Any catalog change is going to broadcast "shared cache inval" messages that all the backends have to process to make sure they get rid of any now-obsolete cached catalog information. > This means that the 120 "idle" postgres backends are together > using almost 100% of one CPU on top of the 100% CPU being used by > pg_restore. See the output of top(1) below. Perhaps you need to try to cut down the number of idle processes ... I don't think anyone's ever spent any time trying to micro-optimize the shared cache inval code paths. It could be we could cut your 1% figure some, if we were willing to put effort into that. But it's not going to go to zero. regards, tom lane