Thread: 7.3.1 New install, large queries are slow
I am trying to get a PostgreSQL server into production (moving from MSSQL2K) but having some serious performance issues. PostgreSQL is new to me, and I'm only just now comfortable with Linux. So far I've succesfully compiled postgresfrom source and migrated all the data from MSSQL. Postgres is primarily accessed using JDBC. I really want to use Postgres for production, but if I can't get better results out of it by the end of the week we are droppingit forever and going back to MSSQL despite the $$$. I'm basically at a point where I've got to find help from thelist. Please help me make this server fly! I have a query that does many joins (including two very big tables) which is slow on Postgres. On PGSQL the query takes19 minutes, but only 3 seconds on MSSQL. The two servers have the same indexes created (including primary key indexes). I finally gave up on creating all the foreign keys in Postgres - after 12 hours of 100% CPU. It's hard for meto believe that the hardware is the bottleneck - the $20k Postgres server far outclasses the MSSQL server (see below forstats). When I ran EXPLAIN ANALYZE for this query the CPU averaged 5%, sar -b shows about 6,000 block reads/sec, andvmstat had zero swapping. EXPLAIN results are below, I'm not sure how to interpret them. The disks are not reading at max speed during the query - when I ran a VACUUM ANALYZE (after data migration), sar -b wasconsistently 100,000 block reads/sec. It does not seem like the hardware is holding back things here. I read somethingabout 'fsync' recently, would changing that setting apply in this case? DATABASE: 'tranheader' table has 2000 tuples, PK index 'batchheader' table has 2.6 million tuples, 5 indexes, FK constraint to tranheader PK 'batchdetail' table has 23 million tuples, 6 indexes, FK constraint to batcheader PK 18 tables with <1000 tuples, most are child tables of batchdetail All tables have a PK and are normalized Large nightly INSERTs (~200000 tuples) Complex SELECTs all day long No UPDATEs or DELETEs ever, at least until I run low on storage! I've learned as much as I can absorb from the online docs and archives about performance tuning. Based on my limited understanding,I've changed the following settings. I am totally open to any suggestions, including starting over with RAID,filesystems, PGSQL. I would almost consider FreeBSD if it helped a lot, but that would be a stretch given my time investmentin Linux. This is a brand new machine, so bad hardware is a possibility - but I'm not sure how to go about determiningthat. *** /etc/sysctl.conf kernel.shmmni = 4096 kernel.shmall = 32000000 kernel.shmmax = 512000000 *** /usr/local/pgsql/data/postgresql.conf tcpip_socket=true shared_buffers = 32768 max_fsm_relations = 10000 max_fsm_pages = 2000000 sort_mem = 8192 POSTGRESQL SYSTEM: Red Hat Linux 8.0, PostgreSQL 7.3.1 (dedicated, besides SSH daemon) Dell PE6600 Dual Xeon MP 2.0GHz, 2MB L3 cache,HyperThreading enabled 4.0 GB Physical RAM /dev/sda1: ext3 101MB /boot /dev/sda2: ext3 34GB / (sda is 2 disk RAID-1) none : swap 1.8GB /dev/sdb1: ext3 104GB /usr/local/pgsql/data (sdb is 6 disk RAID-10) All 8 drives are 36GB, 15k RPM, Ultra160 SCSI PERC3/DC 128MB RAID controller MSSQL SYSTEM: Dell PE1650, Dual P3 1.1GHz, 1.5GB RAM Single 18GB, 15k RPM SCSI drive (no RAID) Windows 2000 Server SP3, SQL Server 2000 SP2 TIA, Roman Fail Sr. Web Application Developer POS Portal, Inc. EXPLAIN ANALYZE RESULTS: Limit (cost=370518.31..370518.31 rows=1 width=540) (actual time=1168722.18..1168722.20 rows=5 loops=1) -> Sort (cost=370518.31..370518.31 rows=1 width=540) (actual time=1168722.18..1168722.18 rows=5 loops=1) Sort Key: b.batchdate -> Nested Loop (cost=314181.17..370518.30 rows=1 width=540) (actual time=1148191.12..1168722.09 rows=5 loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=314181.17..370461.79 rows=1 width=502) (actual time=1148167.55..1168671.80 rows=5 loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=314181.17..370429.29 rows=1 width=485) (actual time=1148167.48..1168671.45 rows=5loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=314181.17..370396.79 rows=1 width=476) (actual time=1148167.41..1168671.08rows=5 loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=314181.17..314402.47 rows=1 width=457) (actual time=1139099.39..1139320.79rows=5 loops=1) Join Filter: ("outer".cardtypeid = "inner".cardtypeid) -> Merge Join (cost=314181.17..314401.24 rows=1 width=443) (actual time=1138912.13..1139133.00rows=5 loops=1) Merge Cond: ("outer".batchid = "inner".batchid) -> Sort (cost=127418.59..127418.59 rows=3 width=150) (actual time=9681.91..9681.93rows=17 loops=1) Sort Key: b.batchid -> Hash Join (cost=120787.32..127418.56 rows=3 width=150) (actual time=7708.04..9681.83rows=17 loops=1) Hash Cond: ("outer".merchantid = "inner".merchantid) -> Merge Join (cost=120781.58..125994.80 rows=283597 width=72)(actual time=7655.57..9320.49 rows=213387 loops=1) Merge Cond: ("outer".tranheaderid = "inner".tranheaderid) -> Index Scan using tranheader_ix_tranheaderid_idx on tranheadert (cost=0.00..121.15 rows=1923 width=16) (actual time=0.15..10.86 rows=1923 loops=1) Filter: (clientid = 6) -> Sort (cost=120781.58..121552.88 rows=308520 width=56)(actual time=7611.75..8162.81 rows=329431 loops=1) Sort Key: b.tranheaderid -> Seq Scan on batchheader b (cost=0.00..79587.23 rows=308520width=56) (actual time=0.90..4186.30 rows=329431 loops=1) Filter: (batchdate > '2002-12-15 00:00:00'::timestampwithout time zone) -> Hash (cost=5.74..5.74 rows=1 width=78) (actual time=31.39..31.39rows=0 loops=1) -> Index Scan using merchants_ix_merchid_idx on merchantsm (cost=0.00..5.74 rows=1 width=78) (actual time=31.38..31.38 rows=1 loops=1) Index Cond: (merchid = '701252267'::character varying) -> Sort (cost=186762.59..186872.62 rows=44010 width=293) (actual time=1127828.96..1128725.39rows=368681 loops=1) Sort Key: d.batchid -> Index Scan using batchdetail_ix_tranamount_idx on batchdetail d (cost=0.00..176768.18rows=44010 width=293) (actual time=35.48..1104625.54 rows=370307 loops=1) Index Cond: ((tranamount >= 500.0) AND (tranamount <= 700.0)) -> Seq Scan on cardtype c (cost=0.00..1.10 rows=10 width=14) (actual time=37.44..37.47rows=10 loops=5) -> Seq Scan on purc1 p1 (cost=0.00..44259.70 rows=938770 width=19) (actual time=98.09..4187.32rows=938770 loops=5) -> Seq Scan on direct dr (cost=0.00..20.00 rows=1000 width=9) (actual time=0.00..0.00 rows=0loops=5) -> Seq Scan on carrental cr (cost=0.00..20.00 rows=1000 width=17) (actual time=0.00..0.00 rows=0 loops=5) -> Seq Scan on checks ck (cost=0.00..40.67 rows=1267 width=38) (actual time=1.03..7.63 rows=1267 loops=5) Total runtime: 1168881.12 msec
Roman Fail wrote: <cut> EXPLAIN ANALYZE RESULTS: Limit (cost=370518.31..370518.31 rows=1 width=540) (actual time=1168722.18..1168722.20 rows=5 loops=1) -> Sort (cost=370518.31..370518.31 rows=1 width=540) (actual time=1168722.18..1168722.18 rows=5 loops=1) Sort Key: b.batchdate -> Nested Loop (cost=314181.17..370518.30 rows=1 width=540) (actual time=1148191.12..1168722.09 rows=5 loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=314181.17..370461.79 rows=1 width=502) (actual time=1148167.55..1168671.80 rows=5 loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=314181.17..370429.29 rows=1 width=485) (actual time=1148167.48..1168671.45 rows=5loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=314181.17..370396.79 rows=1 width=476) (actual time=1148167.41..1168671.08rows=5 loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=314181.17..314402.47 rows=1 width=457) (actual time=1139099.39..1139320.79rows=5 loops=1) Join Filter: ("outer".cardtypeid = "inner".cardtypeid) -> Merge Join (cost=314181.17..314401.24 rows=1 width=443) (actual time=1138912.13..1139133.00rows=5 loops=1) Merge Cond: ("outer".batchid = "inner".batchid) -> Sort (cost=127418.59..127418.59 rows=3 width=150) (actual time=9681.91..9681.93rows=17 loops=1) Sort Key: b.batchid -> Hash Join (cost=120787.32..127418.56 rows=3 width=150) (actual time=7708.04..9681.83rows=17 loops=1) Hash Cond: ("outer".merchantid = "inner".merchantid) -> Merge Join (cost=120781.58..125994.80 rows=283597 width=72)(actual time=7655.57..9320.49 rows=213387 loops=1) Merge Cond: ("outer".tranheaderid = "inner".tranheaderid) -> Index Scan using tranheader_ix_tranheaderid_idx on tranheadert (cost=0.00..121.15 rows=1923 width=16) (actual time=0.15..10.86 rows=1923 loops=1) Filter: (clientid = 6) -> Sort (cost=120781.58..121552.88 rows=308520 width=56)(actual time=7611.75..8162.81 rows=329431 loops=1) Sort Key: b.tranheaderid -> Seq Scan on batchheader b (cost=0.00..79587.23 rows=308520width=56) (actual time=0.90..4186.30 rows=329431 loops=1) Filter: (batchdate > '2002-12-15 00:00:00'::timestampwithout time zone) -> Hash (cost=5.74..5.74 rows=1 width=78) (actual time=31.39..31.39rows=0 loops=1) -> Index Scan using merchants_ix_merchid_idx on merchantsm (cost=0.00..5.74 rows=1 width=78) (actual time=31.38..31.38 rows=1 loops=1) Index Cond: (merchid = '701252267'::character varying) -> Sort (cost=186762.59..186872.62 rows=44010 width=293) (actual time=1127828.96..1128725.39rows=368681 loops=1) Sort Key: d.batchid -> Index Scan using batchdetail_ix_tranamount_idx on batchdetail d (cost=0.00..176768.18rows=44010 width=293) (actual time=35.48..1104625.54 rows=370307 loops=1) Index Cond: ((tranamount >= 500.0) AND (tranamount <= 700.0)) -> Seq Scan on cardtype c (cost=0.00..1.10 rows=10 width=14) (actual time=37.44..37.47rows=10 loops=5) -> Seq Scan on purc1 p1 (cost=0.00..44259.70 rows=938770 width=19) (actual time=98.09..4187.32rows=938770 loops=5) -> Seq Scan on direct dr (cost=0.00..20.00 rows=1000 width=9) (actual time=0.00..0.00 rows=0loops=5) -> Seq Scan on carrental cr (cost=0.00..20.00 rows=1000 width=17) (actual time=0.00..0.00 rows=0 loops=5) -> Seq Scan on checks ck (cost=0.00..40.67 rows=1267 width=38) (actual time=1.03..7.63 rows=1267 loops=5) Total runtime: 1168881.12 msec <cut> It looks like your execution time is not a hardware, but query problem. Query nearly doesn't use indexes at all. You said, that that you have normalized database, so you should have a lot of explicit joins, which work pretty well on Postgresql. Can you add some examples of your queries? If it is difficult for you, at least create one example, when you get "Join Filter" on "explain analyze". From your analyze result: Seq Scan on batchheader b (cost=0.00..79587.23 rows=308520 width=56) Can you write what condition and indexes does batchheader have? Regards, Tomasz Myrta
Roman, First, if this is a dedicated PostgreSQL server, you should try increasing your shared_buffers to at least 512mb (65536) if not 1GB (double that) and adjust your shmmax and shmmall to match. Second, you will probably want to increase your sort_mem as well. How much depeneds on the number of concurrent queries you expect to be running and their relative complexity. Give me that information, and I'll offer you some suggestions. Part of your slow query Your query problem is hopefully relatively easy. The following clause is 95% of your query time: > -> Index Scan using batchdetail_ix_tranamount_idx on batchdetail d (cost=0.00..176768.18 rows=44010 width=293) (actual time=35.48..1104625.54 rows=370307 loops=1) > See the actual time figures? This one clause is taking 1,104,590 msec! Now, why? Well, look at the cost estimate figures in contrast to the actual row count: estimate rows = 44,010 real rows 370,307 That's off by a factor of 9. This index scan is obviously very cumbersome and is slowing the query down. Probably it should be using a seq scan instead ... my guess is, you haven't run ANALYZE in a while and the incorrect row estimate is causing the parser to choose a very slow index scan. Try running ANALYZE on your database and re-running the query. Also try using REINDEX on batchdetail_ix_tranamount_idx . Second, this clause near the bottom: -> Seq Scan on purc1 p1 (cost=0.00..44259.70 rows=938770 width=19) (actual time=98.09..4187.32 rows=938770 loops=5) ... suggests that you could save an additional 4 seconds by figuring out a way for the criteria on purc1 to use a relevant index -- but only after you've solved the problem with batchdetail_ix_tranamount_idx. Finally, if you really want help, post the query. -- -Josh Berkus Aglio Database Solutions San Francisco
On Wed, 15 Jan 2003, Roman Fail wrote: > The disks are not reading at max speed during the query - when I ran a > VACUUM ANALYZE (after data migration), sar -b was consistently 100,000 > block reads/sec. It does not seem like the hardware is holding back > things here. I read something about 'fsync' recently, would changing > that setting apply in this case? You ran vacuum analyze, but some of the explain still looks suspiciously like it's using default statistics (dr and cr for example, unless they really do have 1000 rows). What are the actual query and table definitions for the query?
I didn't see the query itself in the message, but it looks to me like it's poorly formed. Could you send it? By quick glance, either you're using a bunch of explicit joins that are poorly formed (you've made a bad choice in order) or those particular IDs are really popular. There are a number of sequential scans that possibly should be index scans. > EXPLAIN ANALYZE RESULTS: > Limit (cost=370518.31..370518.31 rows=1 width=540) (actual time=1168722.18..1168722.20 rows=5 loops=1) > -> Sort (cost=370518.31..370518.31 rows=1 width=540) (actual time=1168722.18..1168722.18 rows=5 loops=1) > Sort Key: b.batchdate -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc
Attachment
On Wed, Jan 15, 2003 at 10:00:04AM -0800, Roman Fail wrote: > I have a query that does many joins (including two very big tables) > which is slow on Postgres. On PGSQL the query takes 19 minutes, There are three things I can think of right off the bat. First, the performance of foreign keys is flat-out awful in Postgres. I suggest avoiding them if you can. Second, ordering joins explicitly (with the JOIN keyword) constrains the planner, and may select bad plan. The explain analyse output was nice, but I didn't see the query, so I can't tell what the plan maybe ought to be. Third, I didn't see any suggestion that you'd moved the WAL onto its own disk. That will mostly help when you are under write load; I guess it's not a problem here, but it's worth keeping in mind. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Roman Fail kirjutas K, 15.01.2003 kell 23:00: > I am trying to get a PostgreSQL server into production (moving from MSSQL2K) but having some serious performance issues. PostgreSQL is new to me, and I'm only just now comfortable with Linux. So far I've succesfully compiled postgresfrom source and migrated all the data from MSSQL. Postgres is primarily accessed using JDBC. > > I really want to use Postgres for production, but if I can't get better results out of it by the end of the week we aredropping it forever and going back to MSSQL despite the $$$. I'm basically at a point where I've got to find help fromthe list. Please help me make this server fly! > > I have a query that does many joins (including two very big tables) which is slow on Postgres. On PGSQL the query takes19 minutes, but only 3 seconds on MSSQL. The two servers have the same indexes created (including primary key indexes). I finally gave up on creating all the foreign keys in Postgres - after 12 hours of 100% CPU. It's hard for meto believe that the hardware is the bottleneck - the $20k Postgres server far outclasses the MSSQL server (see below forstats). When I ran EXPLAIN ANALYZE for this query the CPU averaged 5%, sar -b shows about 6,000 block reads/sec, andvmstat had zero swapping. EXPLAIN results are below, I'm not sure how to interpret them. > Two questions: 1) Have you run analyze on this database (after loading the data ?) 2) could you also post the actual query - it would make interpreting the EXPLAIN ANALYZE RESULTS easier. -- Hannu Krosing <hannu@tm.ee>
Thanks to everyone for the quick replies! I'm sure that my lack of skill with SQL queries is the main problem. What's strangeto me is how MSSQL takes my bad queries and makes them look good anyway. It must have a real smart planner. Several changes: shared_buffers = 131072, sort_mem = 32768, shmmax = 2097152000, shmall = 131072000. I couldn't find anyinfo out there on the relationship between shmmax and shmall, so I just preserved the ratio from the RedHat defaults (1:16). As far as sort_mem goes, I expect to be running no more than 3 concurrent queries and they will all be just as complexas this one. Do you think sort_mem=32768 is a reasonable size? None of these changes seemed to help speed up thingshowever. REINDEX INDEX batchdetail_ix_tranamount_idx; was executed successfully, although it took 15 minutes. ANALYZE executed in 2 minutes, even though I increased default_statistics_target = 30. Should I increase it even more? I don't mind the extra overhead each night if it will make my queries faster. (Idiot check: I did actually stop and startthe postmaster after changing all these settings). Andrew Sullivan wrote: >First, the performance of foreign keys is flat-out awful in Postgres. >I suggest avoiding them if you can. I don't have any problem getting rid of FKs, especially if it might actually help performance. The nightly data import iswell-defined and should always observe referential integrity, so I guess the db doesn't really need to enforce it. InMSSQL, adding FKs was supposed to actually benefit SELECT performance. Is it pretty much universally accepted that I shoulddrop all my foreign keys? >Second, ordering joins explicitly (with the JOIN keyword) constrains >the planner, and may select bad plan. The explain analyse output >was nice, but I didn't see the query, so I can't tell what the plan >maybe ought to be. I think this is the most likely problem. I've read through Chapter 10 of the 7.3 docs, but I still don't feel like I knowwhat would be a good order. How do you learn this stuff anyway? Trial and error? >Third, I didn't see any suggestion that you'd moved the WAL onto its >own disk. That will mostly help when you are under write load; I don't think I'm going to bother with moving the WAL....the write load during the day is very, very light (when queriesare run). Disk I/O is clearly not the limiting factor (yet!). So here's the query, and another EXPLAIN ANALYZE to go with it (executed after all setting changes). The same result columnsand JOINS are performed all day with variations on the WHERE clause; other possible search columns are the ones thatare indexed (see below). The 4 tables that use LEFT JOIN only sometimes have matching records, hence the OUTER join. EXPLAIN ANALYZE SELECT b.batchdate, d.batchdetailid, t.bankno, d.trandate, d.tranamount, d.submitinterchange, d.authamount, d.authno, d.cardtypeid, d.mcccode, m.name AS merchantname, c.cardtype, m.merchid, p1.localtaxamount, p1.productidentifier, dr.avsresponse, cr.checkoutdate, cr.noshowindicator, ck.checkingacctno, ck.abaroutingno, ck.checkno FROM tranheader t INNER JOIN batchheader b ON t.tranheaderid = b.tranheaderid INNER JOIN merchants m ON m.merchantid = b.merchantid INNER JOIN batchdetail d ON d.batchid = b.batchid INNER JOIN cardtype c ON d.cardtypeid = c.cardtypeid LEFT JOIN purc1 p1 ON p1.batchdetailid = d.batchdetailid LEFT JOIN direct dr ON dr.batchdetailid = d.batchdetailid LEFT JOIN carrental cr ON cr.batchdetailid = d.batchdetailid LEFT JOIN checks ck ON ck.batchdetailid = d.batchdetailid WHERE t.clientid = 6 AND d.tranamount BETWEEN 500.0 AND 700.0 AND b.batchdate > '2002-12-15' AND m.merchid = '701252267' ORDER BY b.batchdate DESC LIMIT 50 Limit (cost=1829972.39..1829972.39 rows=1 width=285) (actual time=1556497.79..1556497.80 rows=5 loops=1) -> Sort (cost=1829972.39..1829972.39 rows=1 width=285) (actual time=1556497.78..1556497.79 rows=5 loops=1) Sort Key: b.batchdate -> Nested Loop (cost=1771874.32..1829972.38 rows=1 width=285) (actual time=1538783.03..1556486.64 rows=5 loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=1771874.32..1829915.87 rows=1 width=247) (actual time=1538760.60..1556439.67 rows=5loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=1771874.32..1829915.86 rows=1 width=230) (actual time=1538760.55..1556439.50 rows=5loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=1771874.32..1829915.85 rows=1 width=221) (actual time=1538760.51..1556439.31rows=5 loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=1771874.32..1773863.81 rows=1 width=202) (actual time=1529153.84..1529329.65rows=5 loops=1) Join Filter: ("outer".cardtypeid = "inner".cardtypeid) -> Merge Join (cost=1771874.32..1773862.58 rows=1 width=188) (actual time=1529142.55..1529317.99rows=5 loops=1) Merge Cond: ("outer".batchid = "inner".batchid) -> Sort (cost=116058.42..116058.43 rows=3 width=118) (actual time=14184.11..14184.14rows=17 loops=1) Sort Key: b.batchid -> Hash Join (cost=109143.44..116058.39 rows=3 width=118) (actual time=12398.29..14184.03rows=17 loops=1) Hash Cond: ("outer".merchantid = "inner".merchantid) -> Merge Join (cost=109137.81..114572.94 rows=295957 width=40)(actual time=12359.75..13848.67 rows=213387 loops=1) Merge Cond: ("outer".tranheaderid = "inner".tranheaderid) -> Index Scan using tranheader_ix_tranheaderid_idx on tranheadert (cost=0.00..121.15 rows=1923 width=16) (actual time=0.17..10.91 rows=1923 loops=1) Filter: (clientid = 6) -> Sort (cost=109137.81..109942.73 rows=321966 width=24)(actual time=12317.83..12848.43 rows=329431 loops=1) Sort Key: b.tranheaderid -> Seq Scan on batchheader b (cost=0.00..79683.44 rows=321966width=24) (actual time=29.93..10422.75 rows=329431 loops=1) Filter: (batchdate > '2002-12-15 00:00:00'::timestampwithout time zone) -> Hash (cost=5.63..5.63 rows=1 width=78) (actual time=21.06..21.06rows=0 loops=1) -> Index Scan using merchants_ix_merchid_idx on merchantsm (cost=0.00..5.63 rows=1 width=78) (actual time=21.05..21.05 rows=1 loops=1) Index Cond: (merchid = '701252267'::character varying) -> Sort (cost=1655815.90..1656810.15 rows=397698 width=70) (actual time=1513860.73..1514497.92rows=368681 loops=1) Sort Key: d.batchid -> Index Scan using batchdetail_ix_tranamount_idx on batchdetail d (cost=0.00..1597522.38rows=397698 width=70) (actual time=14.05..1505397.17 rows=370307 loops=1) Index Cond: ((tranamount >= 500.0) AND (tranamount <= 700.0)) -> Seq Scan on cardtype c (cost=0.00..1.10 rows=10 width=14) (actual time=2.25..2.28rows=10 loops=5) -> Seq Scan on purc1 p1 (cost=0.00..44285.35 rows=941335 width=19) (actual time=2.40..3812.43rows=938770 loops=5) -> Seq Scan on direct dr (cost=0.00..0.00 rows=1 width=9) (actual time=0.00..0.00 rows=0 loops=5) -> Seq Scan on carrental cr (cost=0.00..0.00 rows=1 width=17) (actual time=0.00..0.00 rows=0 loops=5) -> Seq Scan on checks ck (cost=0.00..40.67 rows=1267 width=38) (actual time=0.50..7.05 rows=1267 loops=5) Total runtime: 1556553.76 msec Tomasz Myrta wrote: >Seq Scan on batchheader b (cost=0.00..79587.23 rows=308520 width=56) >Can you write what condition and indexes does batchheader have? batchheader has 2.6 million records: CREATE TABLE public.batchheader ( batchid int8 DEFAULT nextval('"batchheader_batchid_key"'::text) NOT NULL, line int4, tranheaderid int4, merchantid int4, batchdate timestamp, merchref char(16), carryindicator char(1), assocno varchar(6), merchbankno char(4), debitcredit char(1), achpostdate timestamp, trancode char(4), netdeposit numeric(18, 4), CONSTRAINT batchheader_ix_batchid_idx UNIQUE (batchid), CONSTRAINT batchheader_pkey PRIMARY KEY (batchid), CONSTRAINT fk_bh_th FOREIGN KEY (tranheaderid) REFERENCES tranheader (tranheaderid) ON DELETE RESTRICT ON UPDATE NO ACTIONNOT DEFERRABLE INITIALLY IMMEDIATE ) WITH OIDS; CREATE UNIQUE INDEX batchheader_ix_batchid_idx ON batchheader USING btree (batchid); CREATE INDEX batchheader_ix_batchdate_idx ON batchheader USING btree (batchdate); CREATE INDEX batchheader_ix_merchantid_idx ON batchheader USING btree (merchantid); CREATE INDEX batchheader_ix_merchref_idx ON batchheader USING btree (merchref); CREATE INDEX batchheader_ix_netdeposit_idx ON batchheader USING btree (netdeposit); And here's batchdetail too, just for kicks. 23 million records. CREATE TABLE public.batchdetail ( batchdetailid int8 DEFAULT nextval('public.batchdetail_batchdetailid_seq'::text) NOT NULL, line int4, batchid int4, merchno varchar(16), assocno varchar(6), refnumber char(23), trandate timestamp, tranamount numeric(18, 4), netdeposit numeric(18, 4), cardnocfb bytea, bestinterchange char(2), submitinterchange char(2), downgrader1 char(4), downgrader2 char(4), downgrader3_1 char(1), downgrader3_2 char(1), downgrader3_3 char(1), downgrader3_4 char(1), downgrader3_5 char(1), downgrader3_6 char(1), downgrader3_7 char(1), onlineentry char(1), achflag char(1), authsource char(1), cardholderidmeth char(1), catindicator char(1), reimbattribute char(1), motoindicator char(1), authcharind char(1), banknetrefno char(9), banknetauthdate char(6), draftaflag char(1), authcurrencycode char(3), authamount numeric(18, 4), validcode char(4), authresponsecode char(2), debitnetworkid char(3), switchsetindicator char(1), posentrymode char(2), debitcredit char(1), reversalflag char(1), merchantname varchar(25), authno char(6), rejectreason char(4), cardtypeid int4, currencycode char(3), origtranamount numeric(18, 4), foreigncard char(1), carryover char(1), extensionrecord char(2), mcccode char(4), terminalid char(8), submitinterchange3b char(3), purchaseid varchar(25), trancode char(4), CONSTRAINT batchdetail_pkey PRIMARY KEY (batchdetailid) ) WITH OIDS; CREATE INDEX batchdetail_ix_authno_idx ON batchdetail USING btree (authno); CREATE INDEX batchdetail_ix_batchdetailid_idx ON batchdetail USING btree (batchdetailid); CREATE INDEX batchdetail_ix_cardnocfb_idx ON batchdetail USING btree (cardnocfb); CREATE INDEX batchdetail_ix_posentrymode_idx ON batchdetail USING btree (posentrymode); CREATE INDEX batchdetail_ix_submitinterchange3b_idx ON batchdetail USING btree (submitinterchange3b); CREATE INDEX batchdetail_ix_tranamount_idx ON batchdetail USING btree (tranamount); Roman Fail Sr. Web Application Developer POS Portal, Inc. Sacramento, CA
Roman Fail wrote: > I am trying to get a PostgreSQL server into production (moving from > MSSQL2K) but having some serious performance issues. PostgreSQL is > new to me, and I'm only just now comfortable with Linux. So far > I've succesfully compiled postgres from source and migrated all the > data from MSSQL. Postgres is primarily accessed using JDBC. [...] > POSTGRESQL SYSTEM: > Red Hat Linux 8.0, PostgreSQL 7.3.1 (dedicated, besides SSH daemon) > Dell PE6600 Dual Xeon MP 2.0GHz, 2MB L3 cache,HyperThreading enabled > 4.0 GB Physical RAM > /dev/sda1: ext3 101MB /boot > /dev/sda2: ext3 34GB / (sda is 2 disk RAID-1) > none : swap 1.8GB > /dev/sdb1: ext3 104GB /usr/local/pgsql/data (sdb is 6 disk RAID-10) > All 8 drives are 36GB, 15k RPM, Ultra160 SCSI > PERC3/DC 128MB RAID controller Ext3, huh? Ext3 is a journalling filesystem that is capable of journalling data as well as metadata. But if you mount it such that it journals data, writes will be significantly slower. The default for ext3 is to do ordered writes: data is written before the associated metadata transaction commits, but the data itself isn't journalled. But because PostgreSQL synchronously writes the transaction log (using fsync() by default, if I'm not mistaken) and uses sync() during a savepoint, I would think that ordered writes at the filesystem level would probably buy you very little in the way of additional data integrity in the event of a crash. So if I'm right about that, then you might consider using the "data=writeback" option to ext3 on the /usr/local/pgsql/data filesystem. I'd recommend the default ("data=ordered") for everything else. That said, I doubt the above change will make the orders of magnitude difference you're looking for. But every little bit helps... You might also consider experimenting with different filesystems, but others here may be able to chime in with better information on that. People, please correct me if I'm wrong in my analysis of PostgreSQL on ext3 above. If the database on an ext3 filesystem mounted in writeback mode is subject to corruption upon a crash despite the efforts PostgreSQL makes to keep things sane, then writeback mode shouldn't be used! And clearly it shouldn't be used if it doesn't make a significant performance difference. -- Kevin Brown kevin@sysexperts.com
> So here's the query, and another EXPLAIN ANALYZE to go with it > (executed after all setting changes). The same result columns and > JOINS are performed all day with variations on the WHERE clause; other > possible search columns are the ones that are indexed (see below). > The 4 tables that use LEFT JOIN only sometimes have matching records, > hence the OUTER join. > > EXPLAIN ANALYZE > SELECT b.batchdate, d.batchdetailid, t.bankno, d.trandate, d.tranamount, > d.submitinterchange, d.authamount, d.authno, d.cardtypeid, d.mcccode, > m.name AS merchantname, c.cardtype, m.merchid, > p1.localtaxamount, p1.productidentifier, dr.avsresponse, > cr.checkoutdate, cr.noshowindicator, ck.checkingacctno, > ck.abaroutingno, ck.checkno > FROM tranheader t > INNER JOIN batchheader b ON t.tranheaderid = b.tranheaderid > INNER JOIN merchants m ON m.merchantid = b.merchantid > INNER JOIN batchdetail d ON d.batchid = b.batchid > INNER JOIN cardtype c ON d.cardtypeid = c.cardtypeid > LEFT JOIN purc1 p1 ON p1.batchdetailid = d.batchdetailid > LEFT JOIN direct dr ON dr.batchdetailid = d.batchdetailid > LEFT JOIN carrental cr ON cr.batchdetailid = d.batchdetailid > LEFT JOIN checks ck ON ck.batchdetailid = d.batchdetailid > WHERE t.clientid = 6 > AND d.tranamount BETWEEN 500.0 AND 700.0 > AND b.batchdate > '2002-12-15' > AND m.merchid = '701252267' > ORDER BY b.batchdate DESC > LIMIT 50 Well, you might get a little help by replace the from with something like: FROM transheader t, batchheader b, merchants m, cardtype c, batchdetail d LEFT JOIN purc1 p1 on p1.batchdetailid=d.batchdetailid LEFT JOIN direct dr ON dr.batchdetailid = d.batchdetailid LEFT JOIN carrental cr ON cr.batchdetailid = d.batchdetailid LEFT JOIN checks ck ON ck.batchdetailid = d.batchdetailid and adding AND t.tranheaderid=b.tranheaderid AND m.merchantid=b.merchantid AND d.batchid=b.batchid AND c.cardtypeid=d.cardtypeid to the WHERE conditions. That should at least allow it to do some small reordering of the joins. I don't think that alone is going to do much, since most of the time seems to be on the scan of d. What does vacuum verbose batchdetail give you (it'll give an idea of pages anyway)
On Wed, 15 Jan 2003, Roman Fail wrote: > Thanks to everyone for the quick replies! I'm sure that my lack of > skill with SQL queries is the main problem. What's strange to me is > how MSSQL takes my bad queries and makes them look good anyway. It > must have a real smart planner. As a followup, if you do set enable_indexscan=off; before running the explain analyze, what does that give you?
"Roman Fail" <rfail@posportal.com> writes: > Thanks to everyone for the quick replies! I'm sure that my lack of > skill with SQL queries is the main problem. What's strange to me is > how MSSQL takes my bad queries and makes them look good anyway. It > must have a real smart planner. I think more likely the issue is that your use of JOIN syntax is forcing Postgres into a bad plan. MSSQL probably doesn't assign any semantic significance to the use of "a JOIN b" syntax as opposed to "FROM a, b" syntax. Postgres does. Whether this is a bug or a feature depends on your point of view --- but there are folks out there who find it to be a life-saver. You can find some explanations at http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/explicit-joins.html > Is it pretty much universally accepted that I should drop all my > foreign keys? No. They don't have any effect on SELECT performance in Postgres. They will impact update speed, but that's not your complaint (at the moment). Don't throw away data integrity protection until you know you need to. regards, tom lane
Tom, Roman, > I think more likely the issue is that your use of JOIN syntax is > forcing > Postgres into a bad plan. MSSQL probably doesn't assign any semantic > significance to the use of "a JOIN b" syntax as opposed to "FROM a, > b" > syntax. That's correct. MSSQL will reorder equijoins, even when explicitly declared. Hey, Roman, how many records in BatchDetail, anyway? Josh Berkus
Tom Lane wrote: > "Roman Fail" <rfail@posportal.com> writes: > > Thanks to everyone for the quick replies! I'm sure that my lack of > > skill with SQL queries is the main problem. What's strange to me is > > how MSSQL takes my bad queries and makes them look good anyway. It > > must have a real smart planner. > > I think more likely the issue is that your use of JOIN syntax is forcing > Postgres into a bad plan. MSSQL probably doesn't assign any semantic > significance to the use of "a JOIN b" syntax as opposed to "FROM a, b" > syntax. Postgres does. Whether this is a bug or a feature depends on > your point of view --- but there are folks out there who find it to be > a life-saver. Since it *does* depend on one's point of view, would it be possible to have control over this implemented in a session-defined variable (with the default in the GUC, of course)? I wouldn't be surprised if a lot of people get bitten by this. -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > Tom Lane wrote: >> ... Whether this is a bug or a feature depends on >> your point of view --- but there are folks out there who find it to be >> a life-saver. > Since it *does* depend on one's point of view, would it be possible to > have control over this implemented in a session-defined variable (with > the default in the GUC, of course)? I have no objection to doing that --- anyone care to contribute code to make it happen? (I think the trick would be to fold plain-JOIN jointree entries into FROM-list items in planner.c, somewhere near the code that hoists sub-SELECTs into the main join tree. But I haven't tried it, and have no time to in the near future.) regards, tom lane
*********************** > Josh Berkus wrote: > Hey, Roman, how many records in BatchDetail, anyway? 23 million. *********************** > Stephan Szabo wrote: > What does vacuum verbose batchdetail give you (it'll give an idea of pages anyway) trans=# VACUUM VERBOSE batchdetail; INFO: --Relation public.batchdetail-- INFO: Pages 1669047: Changed 0, Empty 0; Tup 23316674: Vac 0, Keep 0, UnUsed 0. Total CPU 85.36s/9.38u sec elapsed 253.38 sec. INFO: --Relation pg_toast.pg_toast_8604247-- INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM trans=# *********************** At Stephan Szabo and Tom Lane's suggestion, I reorganized the query so the JOIN syntax was only used in the outer joins. This did not seem to help at all. Of note: during this query 'sar -b' showed a consistent 6000 blocks read/sec, CPUwas about 2%. EXPLAIN ANALYZE SELECT b.batchdate, d.batchdetailid, t.bankno, d.trandate, d.tranamount, d.submitinterchange, d.authamount, d.authno, d.cardtypeid, d.mcccode, m.name AS merchantname, c.cardtype, m.merchid, p1.localtaxamount, p1.productidentifier, dr.avsresponse, cr.checkoutdate, cr.noshowindicator, ck.checkingacctno, ck.abaroutingno, ck.checkno FROM tranheader t, batchheader b, merchants m, cardtype c, batchdetail d LEFT JOIN purc1 p1 on p1.batchdetailid=d.batchdetailid LEFT JOIN direct dr ON dr.batchdetailid = d.batchdetailid LEFT JOIN carrental cr ON cr.batchdetailid = d.batchdetailid LEFT JOIN checks ck ON ck.batchdetailid = d.batchdetailid WHERE t.tranheaderid=b.tranheaderid AND m.merchantid=b.merchantid AND d.batchid=b.batchid AND c.cardtypeid=d.cardtypeid AND t.clientid = 6 AND d.tranamount BETWEEN 500.0 AND 700.0 AND b.batchdate > '2002-12-15' AND m.merchid = '701252267' ORDER BY b.batchdate DESC LIMIT 50 Limit (cost=1789105.21..1789105.22 rows=1 width=285) (actual time=1222029.59..1222029.61 rows=5 loops=1) -> Sort (cost=1789105.21..1789105.22 rows=1 width=285) (actual time=1222029.58..1222029.59 rows=5 loops=1) Sort Key: b.batchdate -> Nested Loop (cost=1787171.22..1789105.20 rows=1 width=285) (actual time=1221815.14..1222019.46 rows=5 loops=1) Join Filter: ("inner".tranheaderid = "outer".tranheaderid) -> Nested Loop (cost=1787171.22..1789026.02 rows=1 width=269) (actual time=1221809.33..1221978.62 rows=5loops=1) Join Filter: ("inner".cardtypeid = "outer".cardtypeid) -> Merge Join (cost=1787171.22..1789024.79 rows=1 width=255) (actual time=1221802.47..1221971.48 rows=5loops=1) Merge Cond: ("outer".batchid = "inner".batchid) -> Sort (cost=476.17..476.18 rows=4 width=102) (actual time=678.05..678.07 rows=17 loops=1) Sort Key: b.batchid -> Nested Loop (cost=0.00..476.14 rows=4 width=102) (actual time=161.62..677.95 rows=17loops=1) -> Index Scan using merchants_ix_merchid_idx on merchants m (cost=0.00..5.65 rows=1width=78) (actual time=13.87..13.88 rows=1 loops=1) Index Cond: (merchid = '701252267'::character varying) -> Index Scan using batchheader_ix_merchantid_idx on batchheader b (cost=0.00..470.30rows=15 width=24) (actual time=147.72..663.94 rows=17 loops=1) Index Cond: ("outer".merchantid = b.merchantid) Filter: (batchdate > '2002-12-15 00:00:00'::timestamp without time zone) -> Sort (cost=1786695.05..1787621.82 rows=370710 width=153) (actual time=1220080.34..1220722.19rows=368681 loops=1) Sort Key: d.batchid -> Merge Join (cost=1704191.25..1713674.49 rows=370710 width=153) (actual time=1200184.91..1213352.77rows=370307 loops=1) Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) -> Merge Join (cost=1704085.28..1712678.33 rows=370710 width=115) (actual time=1199705.71..1210336.37rows=370307 loops=1) Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) -> Merge Join (cost=1704085.27..1711751.54 rows=370710 width=98) (actual time=1199705.65..1208122.73rows=370307 loops=1) Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) -> Merge Join (cost=1704085.26..1710824.75 rows=370710 width=89) (actualtime=1199705.55..1205977.76 rows=370307 loops=1) Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) -> Sort (cost=1543119.01..1544045.79 rows=370710 width=70) (actualtime=1181172.79..1181902.77 rows=370307 loops=1) Sort Key: d.batchdetailid -> Index Scan using batchdetail_ix_tranamount_idx on batchdetaild (cost=0.00..1489103.46 rows=370710 width=70) (actual time=14.45..1176074.90 rows=370307 loops=1) Index Cond: ((tranamount >= 500.0) AND (tranamount <=700.0)) -> Sort (cost=160966.25..163319.59 rows=941335 width=19) (actualtime=18532.70..20074.09 rows=938770 loops=1) Sort Key: p1.batchdetailid -> Seq Scan on purc1 p1 (cost=0.00..44285.35 rows=941335width=19) (actual time=9.44..9119.83 rows=938770 loops=1) -> Sort (cost=0.01..0.02 rows=1 width=9) (actual time=0.08..0.08 rows=0loops=1) Sort Key: dr.batchdetailid -> Seq Scan on direct dr (cost=0.00..0.00 rows=1 width=9) (actualtime=0.01..0.01 rows=0 loops=1) -> Sort (cost=0.01..0.02 rows=1 width=17) (actual time=0.04..0.04 rows=0 loops=1) Sort Key: cr.batchdetailid -> Seq Scan on carrental cr (cost=0.00..0.00 rows=1 width=17) (actualtime=0.00..0.00 rows=0 loops=1) -> Sort (cost=105.97..109.13 rows=1267 width=38) (actual time=479.17..480.74 rows=1267loops=1) Sort Key: ck.batchdetailid -> Seq Scan on checks ck (cost=0.00..40.67 rows=1267 width=38) (actual time=447.88..475.60rows=1267 loops=1) -> Seq Scan on cardtype c (cost=0.00..1.10 rows=10 width=14) (actual time=1.37..1.39 rows=10 loops=5) -> Seq Scan on tranheader t (cost=0.00..55.15 rows=1923 width=16) (actual time=0.01..5.14 rows=1923 loops=5) Filter: (clientid = 6) Total runtime: 1222157.28 msec *********************** Just to see what would happen, I executed: ALTER TABLE batchdetail ALTER COLUMN tranamount SET STATISTICS 1000; ANALYZE; It seemed to hurt performance if anything. But the EXPLAIN estimate for rows was much closer to the real value than it waspreviously. *********************** It seems to me that the big, big isolated problem is the index scan on batchdetail.tranamount. During this small query,'sar -b' showed consistent 90,000 block reads/sec. (contrast with only 6,000 with larger query index scan). 'top'shows the CPU is at 20% user, 30% system the whole time (contrast with 2% total in larger query above). This resultshere still seem pretty bad (although not as bad as above), but I still don't know what is the bottleneck. And thestrange sar stats are confusing me. EXPLAIN ANALYZE SELECT * FROM batchdetail WHERE tranamount BETWEEN 300 AND 499; Seq Scan on batchdetail (cost=0.00..2018797.11 rows=783291 width=440) (actual time=45.66..283926.58 rows=783687 loops=1) Filter: ((tranamount >= 300::numeric) AND (tranamount <= 499::numeric)) Total runtime: 285032.47 msec *********************** > Stephan Szabo wrote: > As a followup, if you do set enable_indexscan=off; > before running the explain analyze, what does that give you? Now this is very interesting: 'sar -b' shows about 95,000 block reads/sec; CPU is at 20% user 30% system, vmstat shows noswapping, query takes only 5 minutes to execute (which is one-quarter of the time WITH the index scan!!!!). Obviouslythe execution plan is pretty different on this one (query is identical the larger one above). EXPLAIN ANALYZE SELECT b.batchdate, d.batchdetailid, t.bankno, d.trandate, d.tranamount, d.submitinterchange, d.authamount, d.authno, d.cardtypeid, d.mcccode, m.name AS merchantname, c.cardtype, m.merchid, p1.localtaxamount, p1.productidentifier, dr.avsresponse, cr.checkoutdate, cr.noshowindicator, ck.checkingacctno, ck.abaroutingno, ck.checkno FROM tranheader t, batchheader b, merchants m, cardtype c, batchdetail d LEFT JOIN purc1 p1 on p1.batchdetailid=d.batchdetailid LEFT JOIN direct dr ON dr.batchdetailid = d.batchdetailid LEFT JOIN carrental cr ON cr.batchdetailid = d.batchdetailid LEFT JOIN checks ck ON ck.batchdetailid = d.batchdetailid WHERE t.tranheaderid=b.tranheaderid AND m.merchantid=b.merchantid AND d.batchid=b.batchid AND c.cardtypeid=d.cardtypeid AND t.clientid = 6 AND d.tranamount BETWEEN 500.0 AND 700.0 AND b.batchdate > '2002-12-15' AND m.merchid = '701252267' ORDER BY b.batchdate DESC LIMIT 50 Limit (cost=2321460.56..2321460.57 rows=1 width=285) (actual time=308194.57..308194.59 rows=5 loops=1) -> Sort (cost=2321460.56..2321460.57 rows=1 width=285) (actual time=308194.57..308194.58 rows=5 loops=1) Sort Key: b.batchdate -> Nested Loop (cost=2319526.57..2321460.55 rows=1 width=285) (actual time=307988.56..308194.46 rows=5 loops=1) Join Filter: ("inner".tranheaderid = "outer".tranheaderid) -> Nested Loop (cost=2319526.57..2321381.37 rows=1 width=269) (actual time=307982.80..308153.22 rows=5 loops=1) Join Filter: ("inner".cardtypeid = "outer".cardtypeid) -> Merge Join (cost=2319526.57..2321380.14 rows=1 width=255) (actual time=307982.69..308152.82 rows=5loops=1) Merge Cond: ("outer".batchid = "inner".batchid) -> Sort (cost=2316388.70..2317315.47 rows=370710 width=153) (actual time=305976.74..306622.88rows=368681 loops=1) Sort Key: d.batchid -> Merge Join (cost=2233884.90..2243368.15 rows=370710 width=153) (actual time=286452.12..299485.43rows=370307 loops=1) Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) -> Merge Join (cost=2233778.93..2242371.98 rows=370710 width=115) (actual time=286428.77..296939.66rows=370307 loops=1) Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) -> Merge Join (cost=2233778.92..2241445.19 rows=370710 width=98) (actual time=286428.72..294750.01rows=370307 loops=1) Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) -> Merge Join (cost=2233778.91..2240518.40 rows=370710 width=89) (actualtime=286428.60..292606.56 rows=370307 loops=1) Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) -> Sort (cost=2072812.66..2073739.44 rows=370710 width=70) (actualtime=269738.34..270470.83 rows=370307 loops=1) Sort Key: d.batchdetailid -> Seq Scan on batchdetail d (cost=0.00..2018797.11 rows=370710width=70) (actual time=41.66..266568.83 rows=370307 loops=1) Filter: ((tranamount >= 500.0) AND (tranamount <= 700.0)) -> Sort (cost=160966.25..163319.59 rows=941335 width=19) (actualtime=16690.20..18202.65 rows=938770 loops=1) Sort Key: p1.batchdetailid -> Seq Scan on purc1 p1 (cost=0.00..44285.35 rows=941335width=19) (actual time=6.88..7779.31 rows=938770 loops=1) -> Sort (cost=0.01..0.02 rows=1 width=9) (actual time=0.10..0.10 rows=0loops=1) Sort Key: dr.batchdetailid -> Seq Scan on direct dr (cost=0.00..0.00 rows=1 width=9) (actualtime=0.00..0.00 rows=0 loops=1) -> Sort (cost=0.01..0.02 rows=1 width=17) (actual time=0.03..0.03 rows=0 loops=1) Sort Key: cr.batchdetailid -> Seq Scan on carrental cr (cost=0.00..0.00 rows=1 width=17) (actualtime=0.00..0.00 rows=0 loops=1) -> Sort (cost=105.97..109.13 rows=1267 width=38) (actual time=23.32..24.89 rows=1267loops=1) Sort Key: ck.batchdetailid -> Seq Scan on checks ck (cost=0.00..40.67 rows=1267 width=38) (actual time=6.51..19.59rows=1267 loops=1) -> Sort (cost=3137.87..3137.88 rows=4 width=102) (actual time=954.18..954.20 rows=19 loops=1) Sort Key: b.batchid -> Nested Loop (cost=0.00..3137.84 rows=4 width=102) (actual time=236.26..954.04 rows=17loops=1) -> Seq Scan on merchants m (cost=0.00..2667.35 rows=1 width=78) (actual time=2.48..227.71rows=1 loops=1) Filter: (merchid = '701252267'::character varying) -> Index Scan using batchheader_ix_merchantid_idx on batchheader b (cost=0.00..470.30rows=15 width=24) (actual time=233.75..726.22 rows=17 loops=1) Index Cond: ("outer".merchantid = b.merchantid) Filter: (batchdate > '2002-12-15 00:00:00'::timestamp without time zone) -> Seq Scan on cardtype c (cost=0.00..1.10 rows=10 width=14) (actual time=0.02..0.04 rows=10 loops=5) -> Seq Scan on tranheader t (cost=0.00..55.15 rows=1923 width=16) (actual time=0.01..5.21 rows=1923 loops=5) Filter: (clientid = 6) Total runtime: 308323.60 msec *********************** I hope we can come up with something soon.....it seems this index scan is a big part of the problem. I'm still really curiouswhy the disk reads are so few with the index scan. Let's hope I can get it near the 3 second time for MSSQL by Friday! Roman Fail
Roman Fail wrote: > The same result columns and JOINS are performed all day with variations on the WHERE clause; Are there any where clauses which all of theses variation have? If yes - query can be reordered to contain explicit joins for these clauses and to let Postgres to find best solution for other joins. I know, it is not best solution, but sometimes I prefer finding best join order by myself. I create then several views returning the same values, but manualy ordered for specific where clauses. Tomasz Myrta
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Tom Lane wrote: > >> ... Whether this is a bug or a feature depends on > >> your point of view --- but there are folks out there who find it to be > >> a life-saver. > > > Since it *does* depend on one's point of view, would it be possible to > > have control over this implemented in a session-defined variable (with > > the default in the GUC, of course)? > > I have no objection to doing that --- anyone care to contribute code to > make it happen? (I think the trick would be to fold plain-JOIN jointree > entries into FROM-list items in planner.c, somewhere near the code that > hoists sub-SELECTs into the main join tree. But I haven't tried it, and > have no time to in the near future.) I'm looking at the code now (the 7.2.3 code in particular, but I suspect for this purpose the code is likely to be very similar to the CVS tip), but it's all completely new to me and the developer documentation isn't very revealing of the internals. The optimizer code (I've been looking especially at make_jointree_rel() and make_fromexpr_rel()) looks a bit tricky...it'll take me some time to completely wrap my brain around it. Any pointers to revealing documentation would be quite helpful! -- Kevin Brown kevin@sysexperts.com
On Thu, 2003-01-16 at 03:40, Stephan Szabo wrote: > > So here's the query, and another EXPLAIN ANALYZE to go with it > > (executed after all setting changes). The same result columns and > > JOINS are performed all day with variations on the WHERE clause; other > > possible search columns are the ones that are indexed (see below). > > The 4 tables that use LEFT JOIN only sometimes have matching records, > > hence the OUTER join. > > > > EXPLAIN ANALYZE > > SELECT b.batchdate, d.batchdetailid, t.bankno, d.trandate, d.tranamount, > > d.submitinterchange, d.authamount, d.authno, d.cardtypeid, d.mcccode, > > m.name AS merchantname, c.cardtype, m.merchid, > > p1.localtaxamount, p1.productidentifier, dr.avsresponse, > > cr.checkoutdate, cr.noshowindicator, ck.checkingacctno, > > ck.abaroutingno, ck.checkno > > FROM tranheader t > > INNER JOIN batchheader b ON t.tranheaderid = b.tranheaderid > > INNER JOIN merchants m ON m.merchantid = b.merchantid > > INNER JOIN batchdetail d ON d.batchid = b.batchid > > INNER JOIN cardtype c ON d.cardtypeid = c.cardtypeid > > LEFT JOIN purc1 p1 ON p1.batchdetailid = d.batchdetailid > > LEFT JOIN direct dr ON dr.batchdetailid = d.batchdetailid > > LEFT JOIN carrental cr ON cr.batchdetailid = d.batchdetailid > > LEFT JOIN checks ck ON ck.batchdetailid = d.batchdetailid > > WHERE t.clientid = 6 > > AND d.tranamount BETWEEN 500.0 AND 700.0 How much of data in d has tranamount BETWEEN 500.0 AND 700.0 ? Do you have an index on d.tranamount ? > > AND b.batchdate > '2002-12-15' again - how much of b.batchdate > '2002-12-15' ? is there an index > > AND m.merchid = '701252267' ditto > > ORDER BY b.batchdate DESC > > LIMIT 50 these two together make me think that perhaps b.batchdate between '2003-12-12' and '2002-12-15' could be better at making the optimiser see that reverse index scan on b.batchdate would be the way to go. > Well, you might get a little help by replace the from with -- Hannu Krosing <hannu@tm.ee>
On Thu, 2003-01-16 at 03:03, Roman Fail wrote: > *********************** > > Josh Berkus wrote: > > Hey, Roman, how many records in BatchDetail, anyway? > > 23 million. What are the indexes on batchdetail? There's one on batchid and a seperate one on tranamount? If so, what about dropping them and create a single multi-segment index on "batchid, tranamount". (A constraint can then enforce uniqueness on batchid. > *********************** > > Stephan Szabo wrote: > > What does vacuum verbose batchdetail give you (it'll give an idea of pages anyway) > > trans=# VACUUM VERBOSE batchdetail; > INFO: --Relation public.batchdetail-- > INFO: Pages 1669047: Changed 0, Empty 0; Tup 23316674: Vac 0, Keep 0, UnUsed 0. > Total CPU 85.36s/9.38u sec elapsed 253.38 sec. > INFO: --Relation pg_toast.pg_toast_8604247-- > INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. > Total CPU 0.00s/0.00u sec elapsed 0.00 sec. > VACUUM > trans=# > > *********************** > At Stephan Szabo and Tom Lane's suggestion, I reorganized the query > so the JOIN syntax was only used in the outer joins. This did not > seem to help at all. Of note: during this query 'sar -b' showed a > consistent 6000 blocks read/sec, CPU was about 2%. > > EXPLAIN ANALYZE > SELECT b.batchdate, d.batchdetailid, t.bankno, d.trandate, d.tranamount, > d.submitinterchange, d.authamount, d.authno, d.cardtypeid, d.mcccode, > m.name AS merchantname, c.cardtype, m.merchid, > p1.localtaxamount, p1.productidentifier, dr.avsresponse, > cr.checkoutdate, cr.noshowindicator, ck.checkingacctno, > ck.abaroutingno, ck.checkno > FROM tranheader t, batchheader b, merchants m, cardtype c, batchdetail d > LEFT JOIN purc1 p1 on p1.batchdetailid=d.batchdetailid > LEFT JOIN direct dr ON dr.batchdetailid = d.batchdetailid > LEFT JOIN carrental cr ON cr.batchdetailid = d.batchdetailid > LEFT JOIN checks ck ON ck.batchdetailid = d.batchdetailid > WHERE t.tranheaderid=b.tranheaderid > AND m.merchantid=b.merchantid > AND d.batchid=b.batchid > AND c.cardtypeid=d.cardtypeid > AND t.clientid = 6 > AND d.tranamount BETWEEN 500.0 AND 700.0 > AND b.batchdate > '2002-12-15' > AND m.merchid = '701252267' > ORDER BY b.batchdate DESC > LIMIT 50 > Limit (cost=1789105.21..1789105.22 rows=1 width=285) (actual time=1222029.59..1222029.61 rows=5 loops=1) > -> Sort (cost=1789105.21..1789105.22 rows=1 width=285) (actual time=1222029.58..1222029.59 rows=5 loops=1) > Sort Key: b.batchdate > -> Nested Loop (cost=1787171.22..1789105.20 rows=1 width=285) (actual time=1221815.14..1222019.46 rows=5 loops=1) > Join Filter: ("inner".tranheaderid = "outer".tranheaderid) > -> Nested Loop (cost=1787171.22..1789026.02 rows=1 width=269) (actual time=1221809.33..1221978.62 rows=5loops=1) > Join Filter: ("inner".cardtypeid = "outer".cardtypeid) > -> Merge Join (cost=1787171.22..1789024.79 rows=1 width=255) (actual time=1221802.47..1221971.48rows=5 loops=1) > Merge Cond: ("outer".batchid = "inner".batchid) > -> Sort (cost=476.17..476.18 rows=4 width=102) (actual time=678.05..678.07 rows=17 loops=1) > Sort Key: b.batchid > -> Nested Loop (cost=0.00..476.14 rows=4 width=102) (actual time=161.62..677.95 rows=17loops=1) > -> Index Scan using merchants_ix_merchid_idx on merchants m (cost=0.00..5.65 rows=1width=78) (actual time=13.87..13.88 rows=1 loops=1) > Index Cond: (merchid = '701252267'::character varying) > -> Index Scan using batchheader_ix_merchantid_idx on batchheader b (cost=0.00..470.30rows=15 width=24) (actual time=147.72..663.94 rows=17 loops=1) > Index Cond: ("outer".merchantid = b.merchantid) > Filter: (batchdate > '2002-12-15 00:00:00'::timestamp without time zone) > -> Sort (cost=1786695.05..1787621.82 rows=370710 width=153) (actual time=1220080.34..1220722.19rows=368681 loops=1) > Sort Key: d.batchid > -> Merge Join (cost=1704191.25..1713674.49 rows=370710 width=153) (actual time=1200184.91..1213352.77rows=370307 loops=1) > Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) > -> Merge Join (cost=1704085.28..1712678.33 rows=370710 width=115) (actual time=1199705.71..1210336.37rows=370307 loops=1) > Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) > -> Merge Join (cost=1704085.27..1711751.54 rows=370710 width=98) (actualtime=1199705.65..1208122.73 rows=370307 loops=1) > Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) > -> Merge Join (cost=1704085.26..1710824.75 rows=370710 width=89) (actualtime=1199705.55..1205977.76 rows=370307 loops=1) > Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) > -> Sort (cost=1543119.01..1544045.79 rows=370710 width=70) (actualtime=1181172.79..1181902.77 rows=370307 loops=1) > Sort Key: d.batchdetailid > -> Index Scan using batchdetail_ix_tranamount_idx on batchdetaild (cost=0.00..1489103.46 rows=370710 width=70) (actual time=14.45..1176074.90 rows=370307 loops=1) > Index Cond: ((tranamount >= 500.0) AND (tranamount<= 700.0)) > -> Sort (cost=160966.25..163319.59 rows=941335 width=19) (actualtime=18532.70..20074.09 rows=938770 loops=1) > Sort Key: p1.batchdetailid > -> Seq Scan on purc1 p1 (cost=0.00..44285.35 rows=941335width=19) (actual time=9.44..9119.83 rows=938770 loops=1) > -> Sort (cost=0.01..0.02 rows=1 width=9) (actual time=0.08..0.08 rows=0loops=1) > Sort Key: dr.batchdetailid > -> Seq Scan on direct dr (cost=0.00..0.00 rows=1 width=9) (actualtime=0.01..0.01 rows=0 loops=1) > -> Sort (cost=0.01..0.02 rows=1 width=17) (actual time=0.04..0.04 rows=0loops=1) > Sort Key: cr.batchdetailid > -> Seq Scan on carrental cr (cost=0.00..0.00 rows=1 width=17) (actualtime=0.00..0.00 rows=0 loops=1) > -> Sort (cost=105.97..109.13 rows=1267 width=38) (actual time=479.17..480.74 rows=1267loops=1) > Sort Key: ck.batchdetailid > -> Seq Scan on checks ck (cost=0.00..40.67 rows=1267 width=38) (actual time=447.88..475.60rows=1267 loops=1) > -> Seq Scan on cardtype c (cost=0.00..1.10 rows=10 width=14) (actual time=1.37..1.39 rows=10 loops=5) > -> Seq Scan on tranheader t (cost=0.00..55.15 rows=1923 width=16) (actual time=0.01..5.14 rows=1923 loops=5) > Filter: (clientid = 6) > Total runtime: 1222157.28 msec > > *********************** > Just to see what would happen, I executed: > ALTER TABLE batchdetail ALTER COLUMN tranamount SET STATISTICS 1000; > ANALYZE; > It seemed to hurt performance if anything. But the EXPLAIN estimate > for rows was much closer to the real value than it was previously. > > *********************** > It seems to me that the big, big isolated problem is the index scan on > batchdetail.tranamount. During this small query, 'sar -b' showed > consistent 90,000 block reads/sec. (contrast with only 6,000 with > larger query index scan). 'top' shows the CPU is at 20% user, 30% > system the whole time (contrast with 2% total in larger query above). > This results here still seem pretty bad (although not as bad as > above), but I still don't know what is the bottleneck. And the > strange sar stats are confusing me. > > EXPLAIN ANALYZE SELECT * FROM batchdetail WHERE tranamount BETWEEN 300 AND 499; > Seq Scan on batchdetail (cost=0.00..2018797.11 rows=783291 width=440) (actual time=45.66..283926.58 rows=783687 loops=1) > Filter: ((tranamount >= 300::numeric) AND (tranamount <= 499::numeric)) > Total runtime: 285032.47 msec > > > *********************** > > Stephan Szabo wrote: > > As a followup, if you do set enable_indexscan=off; > > before running the explain analyze, what does that give you? > > Now this is very interesting: 'sar -b' shows about 95,000 block > reads/sec; CPU is at 20% user 30% system, vmstat shows no swapping, > query takes only 5 minutes to execute (which is one-quarter of the > time WITH the index scan!!!!). Obviously the execution plan is pretty > different on this one (query is identical the larger one above). > > EXPLAIN ANALYZE > SELECT b.batchdate, d.batchdetailid, t.bankno, d.trandate, d.tranamount, > d.submitinterchange, d.authamount, d.authno, d.cardtypeid, d.mcccode, > m.name AS merchantname, c.cardtype, m.merchid, > p1.localtaxamount, p1.productidentifier, dr.avsresponse, > cr.checkoutdate, cr.noshowindicator, ck.checkingacctno, > ck.abaroutingno, ck.checkno > FROM tranheader t, batchheader b, merchants m, cardtype c, > batchdetail d > LEFT JOIN purc1 p1 on p1.batchdetailid=d.batchdetailid > LEFT JOIN direct dr ON dr.batchdetailid = d.batchdetailid > LEFT JOIN carrental cr ON cr.batchdetailid = d.batchdetailid > LEFT JOIN checks ck ON ck.batchdetailid = d.batchdetailid > WHERE t.tranheaderid=b.tranheaderid > AND m.merchantid=b.merchantid > AND d.batchid=b.batchid > AND c.cardtypeid=d.cardtypeid > AND t.clientid = 6 > AND d.tranamount BETWEEN 500.0 AND 700.0 > AND b.batchdate > '2002-12-15' > AND m.merchid = '701252267' > ORDER BY b.batchdate DESC > LIMIT 50 > Limit (cost=2321460.56..2321460.57 rows=1 width=285) (actual time=308194.57..308194.59 rows=5 loops=1) > -> Sort (cost=2321460.56..2321460.57 rows=1 width=285) (actual time=308194.57..308194.58 rows=5 loops=1) > Sort Key: b.batchdate > -> Nested Loop (cost=2319526.57..2321460.55 rows=1 width=285) (actual time=307988.56..308194.46 rows=5 loops=1) > Join Filter: ("inner".tranheaderid = "outer".tranheaderid) > -> Nested Loop (cost=2319526.57..2321381.37 rows=1 width=269) (actual time=307982.80..308153.22 rows=5loops=1) > Join Filter: ("inner".cardtypeid = "outer".cardtypeid) > -> Merge Join (cost=2319526.57..2321380.14 rows=1 width=255) (actual time=307982.69..308152.82 rows=5loops=1) > Merge Cond: ("outer".batchid = "inner".batchid) > -> Sort (cost=2316388.70..2317315.47 rows=370710 width=153) (actual time=305976.74..306622.88rows=368681 loops=1) > Sort Key: d.batchid > -> Merge Join (cost=2233884.90..2243368.15 rows=370710 width=153) (actual time=286452.12..299485.43rows=370307 loops=1) > Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) > -> Merge Join (cost=2233778.93..2242371.98 rows=370710 width=115) (actual time=286428.77..296939.66rows=370307 loops=1) > Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) > -> Merge Join (cost=2233778.92..2241445.19 rows=370710 width=98) (actualtime=286428.72..294750.01 rows=370307 loops=1) > Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) > -> Merge Join (cost=2233778.91..2240518.40 rows=370710 width=89) (actualtime=286428.60..292606.56 rows=370307 loops=1) > Merge Cond: ("outer".batchdetailid = "inner".batchdetailid) > -> Sort (cost=2072812.66..2073739.44 rows=370710 width=70) (actualtime=269738.34..270470.83 rows=370307 loops=1) > Sort Key: d.batchdetailid > -> Seq Scan on batchdetail d (cost=0.00..2018797.11 rows=370710width=70) (actual time=41.66..266568.83 rows=370307 loops=1) > Filter: ((tranamount >= 500.0) AND (tranamount <= 700.0)) > -> Sort (cost=160966.25..163319.59 rows=941335 width=19) (actualtime=16690.20..18202.65 rows=938770 loops=1) > Sort Key: p1.batchdetailid > -> Seq Scan on purc1 p1 (cost=0.00..44285.35 rows=941335width=19) (actual time=6.88..7779.31 rows=938770 loops=1) > -> Sort (cost=0.01..0.02 rows=1 width=9) (actual time=0.10..0.10 rows=0loops=1) > Sort Key: dr.batchdetailid > -> Seq Scan on direct dr (cost=0.00..0.00 rows=1 width=9) (actualtime=0.00..0.00 rows=0 loops=1) > -> Sort (cost=0.01..0.02 rows=1 width=17) (actual time=0.03..0.03 rows=0loops=1) > Sort Key: cr.batchdetailid > -> Seq Scan on carrental cr (cost=0.00..0.00 rows=1 width=17) (actualtime=0.00..0.00 rows=0 loops=1) > -> Sort (cost=105.97..109.13 rows=1267 width=38) (actual time=23.32..24.89 rows=1267loops=1) > Sort Key: ck.batchdetailid > -> Seq Scan on checks ck (cost=0.00..40.67 rows=1267 width=38) (actual time=6.51..19.59rows=1267 loops=1) > -> Sort (cost=3137.87..3137.88 rows=4 width=102) (actual time=954.18..954.20 rows=19 loops=1) > Sort Key: b.batchid > -> Nested Loop (cost=0.00..3137.84 rows=4 width=102) (actual time=236.26..954.04 rows=17loops=1) > -> Seq Scan on merchants m (cost=0.00..2667.35 rows=1 width=78) (actual time=2.48..227.71rows=1 loops=1) > Filter: (merchid = '701252267'::character varying) > -> Index Scan using batchheader_ix_merchantid_idx on batchheader b (cost=0.00..470.30rows=15 width=24) (actual time=233.75..726.22 rows=17 loops=1) > Index Cond: ("outer".merchantid = b.merchantid) > Filter: (batchdate > '2002-12-15 00:00:00'::timestamp without time zone) > -> Seq Scan on cardtype c (cost=0.00..1.10 rows=10 width=14) (actual time=0.02..0.04 rows=10 loops=5) > -> Seq Scan on tranheader t (cost=0.00..55.15 rows=1923 width=16) (actual time=0.01..5.21 rows=1923 loops=5) > Filter: (clientid = 6) > Total runtime: 308323.60 msec > > *********************** > I hope we can come up with something soon.....it seems this index > scan is a big part of the problem. I'm still really curious why the > disk reads are so few with the index scan. Let's hope I can get it > near the 3 second time for MSSQL by Friday! -- +------------------------------------------------------------+ | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | "Basically, I got on the plane with a bomb. Basically, I | | tried to ignite it. Basically, yeah, I intended to damage | | the plane." | | RICHARD REID, who tried to blow up American Airlines | | Flight 63 | +------------------------------------------------------------+
On Wed, Jan 15, 2003 at 03:30:55PM -0800, Roman Fail wrote: > Thanks to everyone for the quick replies! I'm sure that my lack of skill with SQL queries is the main problem. What'sstrange to me is how MSSQL takes my bad queries and makes them look good anyway. It must have a real smart planner. > Andrew Sullivan wrote: > >First, the performance of foreign keys is flat-out awful in Postgres. > >I suggest avoiding them if you can. > > I don't have any problem getting rid of FKs, especially if it might > actually help performance. The nightly data import is well-defined Sorry, I think I sent this too quickly. FKs make no difference to SELECT performance, so if you're not doing updates and the like at the same time as the SELECTs, there's no advantage. So you should leave the FKs in place. > I think this is the most likely problem. I've read through Chapter > 10 of the 7.3 docs, but I still don't feel like I know what would > be a good order. How do you learn this stuff anyway? Trial and > error? Sorry, but yes. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Thu, 2003-01-16 at 07:29, Ron Johnson wrote: > On Thu, 2003-01-16 at 03:03, Roman Fail wrote: > > *********************** > > > Josh Berkus wrote: > > > Hey, Roman, how many records in BatchDetail, anyway? > > > > 23 million. > > What are the indexes on batchdetail? > > There's one on batchid and a seperate one on tranamount? > > If so, what about dropping them and create a single multi-segment > index on "batchid, tranamount". (A constraint can then enforce > uniqueness on batchid. Thats a good step. Once done, CLUSTER by that index -- might buy 10 to 20% extra. -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc
Attachment
"Roman Fail" <rfail@posportal.com> writes: > SELECT ... > FROM tranheader t, batchheader b, merchants m, cardtype c, (batchdetail d > LEFT JOIN purc1 p1 on p1.batchdetailid=d.batchdetailid > LEFT JOIN direct dr ON dr.batchdetailid = d.batchdetailid > LEFT JOIN carrental cr ON cr.batchdetailid = d.batchdetailid > LEFT JOIN checks ck ON ck.batchdetailid = d.batchdetailid) > WHERE t.tranheaderid=b.tranheaderid > AND m.merchantid=b.merchantid > AND d.batchid=b.batchid > AND c.cardtypeid=d.cardtypeid > AND t.clientid = 6 > AND d.tranamount BETWEEN 500.0 AND 700.0 > AND b.batchdate > '2002-12-15' > AND m.merchid = '701252267' No no no ... this is even worse than before. Your big tables are batchdetail (d) and purc1 (p1). What you've got to do is arrange the computation so that those are trimmed to just the interesting records as soon as possible. The constraint on d.tranamount helps, but after that you proceed to join d to p1 *first*, before any of the other constraints can be applied. That's a huge join that you then proceed to throw away most of, as shown by the row counts in the EXPLAIN output. Note the parentheses I added above to show how the system interprets your FROM clause. Since dr,cr,ck are contributing nothing to elimination of records, you really want them joined last, not first. What would probably work better is SELECT ... FROM (SELECT ... FROM tranheader t, batchheader b, merchants m, cardtype c, batchdetail d WHERE t.tranheaderid=b.tranheaderid AND m.merchantid=b.merchantid AND d.batchid=b.batchid AND c.cardtypeid=d.cardtypeid AND t.clientid = 6 AND d.tranamount BETWEEN 500.0 AND 700.0 AND b.batchdate > '2002-12-15' AND m.merchid = '701252267') ss LEFT JOIN purc1 p1 on p1.batchdetailid=ss.batchdetailid LEFT JOIN direct dr ON dr.batchdetailid = ss.batchdetailid LEFT JOIN carrental cr ON cr.batchdetailid = ss.batchdetailid LEFT JOIN checks ck ON ck.batchdetailid = ss.batchdetailid which lets the system get the useful restrictions applied before it has to finish expanding out the star query. Since cardtype isn't contributing any restrictions, you might think about moving it into the LEFT JOIN series too (although I think the planner will choose to join it last in the subselect, anyway). regards, tom lane
Kevin Brown <kevin@sysexperts.com> writes: > I'm looking at the code now (the 7.2.3 code in particular, but I > suspect for this purpose the code is likely to be very similar to the > CVS tip), but it's all completely new to me and the developer > documentation isn't very revealing of the internals. The optimizer > code (I've been looking especially at make_jointree_rel() and > make_fromexpr_rel()) looks a bit tricky...it'll take me some time to > completely wrap my brain around it. Any pointers to revealing > documentation would be quite helpful! src/backend/optimizer/README is a good place to start. I'd recommend working with CVS tip; there is little point in doing any nontrivial development in the 7.2 branch. You'd have to port it forward anyway. regards, tom lane
I was surprised to hear that JOIN syntax constrained the planner. We have a policy of using JOIN syntax to describe the table relationships and where clauses to describe the selection process for our queries. It was our understanding that the JOIN syntax was introduced to support this approach, but not to contrain the planner. Is there any way to sell the planner to consider JOIN syntax as equivalent to WHERE clauses and to not use them to force the planner down a specific path? Can we get that added as an option (and then made available to use JDBC folks as a URL parameter). It would make my team very happy :-). I think that making this an option will help all those migrating to Postgres who did not expect that JOINs forced the planner down specific plans. Is it possible/reasonable to add? Charlie Tom Lane wrote: >"Roman Fail" <rfail@posportal.com> writes: > > >>Thanks to everyone for the quick replies! I'm sure that my lack of >>skill with SQL queries is the main problem. What's strange to me is >>how MSSQL takes my bad queries and makes them look good anyway. It >>must have a real smart planner. >> >> > >I think more likely the issue is that your use of JOIN syntax is forcing >Postgres into a bad plan. MSSQL probably doesn't assign any semantic >significance to the use of "a JOIN b" syntax as opposed to "FROM a, b" >syntax. Postgres does. Whether this is a bug or a feature depends on >your point of view --- but there are folks out there who find it to be >a life-saver. You can find some explanations at >http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/explicit-joins.html > > > >>Is it pretty much universally accepted that I should drop all my >>foreign keys? >> >> > >No. They don't have any effect on SELECT performance in Postgres. >They will impact update speed, but that's not your complaint (at the >moment). Don't throw away data integrity protection until you know >you need to. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster > > -- Charles H. Woloszynski ClearMetrix, Inc. 115 Research Drive Bethlehem, PA 18015 tel: 610-419-2210 x400 fax: 240-371-3256 web: www.clearmetrix.com
Is this a TODO item? --------------------------------------------------------------------------- Charles H. Woloszynski wrote: > I was surprised to hear that JOIN syntax constrained the planner. We > have a policy of using JOIN syntax to describe the table relationships > and where clauses to describe the selection process for our queries. It > was our understanding that the JOIN syntax was introduced to support > this approach, but not to contrain the planner. > > Is there any way to sell the planner to consider JOIN syntax as > equivalent to WHERE clauses and to not use them to force the planner > down a specific path? Can we get that added as an option (and then made > available to use JDBC folks as a URL parameter). It would make my team > very happy :-). > > > I think that making this an option will help all those migrating to > Postgres who did not expect that JOINs forced the planner down specific > plans. Is it possible/reasonable to add? > > Charlie > > > Tom Lane wrote: > > >"Roman Fail" <rfail@posportal.com> writes: > > > > > >>Thanks to everyone for the quick replies! I'm sure that my lack of > >>skill with SQL queries is the main problem. What's strange to me is > >>how MSSQL takes my bad queries and makes them look good anyway. It > >>must have a real smart planner. > >> > >> > > > >I think more likely the issue is that your use of JOIN syntax is forcing > >Postgres into a bad plan. MSSQL probably doesn't assign any semantic > >significance to the use of "a JOIN b" syntax as opposed to "FROM a, b" > >syntax. Postgres does. Whether this is a bug or a feature depends on > >your point of view --- but there are folks out there who find it to be > >a life-saver. You can find some explanations at > >http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/explicit-joins.html > > > > > > > >>Is it pretty much universally accepted that I should drop all my > >>foreign keys? > >> > >> > > > >No. They don't have any effect on SELECT performance in Postgres. > >They will impact update speed, but that's not your complaint (at the > >moment). Don't throw away data integrity protection until you know > >you need to. > > > > regards, tom lane > > > >---------------------------(end of broadcast)--------------------------- > >TIP 4: Don't 'kill -9' the postmaster > > > > > > -- > > > Charles H. Woloszynski > > ClearMetrix, Inc. > 115 Research Drive > Bethlehem, PA 18015 > > tel: 610-419-2210 x400 > fax: 240-371-3256 > web: www.clearmetrix.com > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Thu, 16 Jan 2003, Roman Fail wrote: > *********************** Hmm, I wonder if maybe we're going about things backwards in this case. Does the original database have something like EXPLAIN that'll show what it's doing? Perhaps that'll give an idea. > > What does vacuum verbose batchdetail give you (it'll give an idea of pages anyway) > > trans=# VACUUM VERBOSE batchdetail; > INFO: --Relation public.batchdetail-- > INFO: Pages 1669047: Changed 0, Empty 0; Tup 23316674: Vac 0, Keep 0, UnUsed 0. So about 12 gigabytes of data, then? > It seems to me that the big, big isolated problem is the index scan on > batchdetail.tranamount. During this small query, 'sar -b' showed > consistent 90,000 block reads/sec. (contrast with only 6,000 with > larger query index scan). 'top' shows the CPU is at 20% user, 30% > system the whole time (contrast with 2% total in larger query above). Note that in this case below, you've gotten a sequence scan not an index scan. (similar to setting enable_indexscan=off performance) > This results here still seem pretty bad (although not as bad as > above), but I still don't know what is the bottleneck. And the > strange sar stats are confusing me. > > EXPLAIN ANALYZE SELECT * FROM batchdetail WHERE tranamount BETWEEN 300 AND 499; > Seq Scan on batchdetail (cost=0.00..2018797.11 rows=783291 width=440) (actual time=45.66..283926.58 rows=783687 loops=1) > Filter: ((tranamount >= 300::numeric) AND (tranamount <= 499::numeric)) > Total runtime: 285032.47 msec I'd assume that tranamount values are fairly randomly distributed throughout the table, right? It takes about 5 minutes for the system to read the entire table and more for the index scan, so you're probably reading most of the table randomly and the index as well. What values on batchdetail do you use in query where clauses regularly? It's possible that occasional clusters would help if this was the main field you filtered on. The cluster itself is time consuming, but it might help make the index scans actually read fewer pages.
Roman, Tom: > No no no ... this is even worse than before. Your big tables are > batchdetail (d) and purc1 (p1). What you've got to do is arrange the > computation so that those are trimmed to just the interesting records > as > soon as possible. When joining disproportionally large tables, I've also had some success with the following method: SELECT b.batchdate, d.batchdetailid, t.bankno, d.trandate, d.tranamount, d.submitinterchange, d.authamount, d.authno, d.cardtypeid, d.mcccode, m.name AS merchantname, c.cardtype, m.merchid, p1.localtaxamount, p1.productidentifier, dr.avsresponse, cr.checkoutdate, cr.noshowindicator, ck.checkingacctno, ck.abaroutingno, ck.checkno FROM tranheader t JOIN batchheader b ON (t.tranheaderid = b.tranheaderid AND b.batchdate > '2002-12-15') JOIN merchants m ON (m.merchantid = b.merchantid AND mmerchid = '701252267') JOIN batchdetail d ON (d.batchid = b.batchid AND d.tranamount BETWEEN 500 and 700) JOIN cardtype c ON d.cardtypeid = c.cardtypeid LEFT JOIN purc1 p1 ON p1.batchdetailid = d.batchdetailid LEFT JOIN direct dr ON dr.batchdetailid = d.batchdetailid LEFT JOIN carrental cr ON cr.batchdetailid = d.batchdetailid LEFT JOIN checks ck ON ck.batchdetailid = d.batchdetailid WHERE t.clientid = 6 AND d.tranamount BETWEEN 500.0 AND 700.0 AND b.batchdate > '2002-12-15' AND m.merchid = '701252267' ORDER BY b.batchdate DESC LIMIT 50 This could be re-arranged some, but I think you get the idea ... I've been able, in some queries, to get the planner to use a better and faster join strategy by repeating my WHERE conditions in the JOIN criteria. -Josh
Roman, > > Hey, Roman, how many records in BatchDetail, anyway? > > 23 million. And MSSQL is returning results in 3 seconds? I find that a bit hard to believe, unless this query is called repeatedly and that's the figure for the last call, where the records are being cached. I'll have to look at your hardware descriptions again. > It seems to me that the big, big isolated problem is the index scan > on batchdetail.tranamount. Nope. This was a misimpression caused by batchdetail waiting for a bunch of other processes to complete. Sometimes the parallelizing gives me a wrong impression of what's holding up the query. Sorry if I confused you. > I hope we can come up with something soon.....it seems this index > scan is a big part of the problem. I'm still really curious why the > disk reads are so few with the index scan. Let's hope I can get it > near the 3 second time for MSSQL by Friday! Um, Roman, keep in mind this is a mailing list. I'm sure that everyone here is happy to give you the tools to figure out how to fix things, but only in a DIY fashion, and not on your schedule. If you have a deadline, you'd better hire some paid query/database tuning help. DB Tuning experts .... whether on MSSQL or Postgres ... run about $250/hour last I checked. -Josh Berkus
"Josh Berkus" <josh@agliodbs.com> writes: > This could be re-arranged some, but I think you get the idea ... I've > been able, in some queries, to get the planner to use a better and > faster join strategy by repeating my WHERE conditions in the JOIN > criteria. Hm. It shouldn't be necessary to do that --- the planner should be able to push down the WHERE conditions to the right place without that help. The list of explicit JOINs as you have here is a good way to proceed *if* you write the JOINs in an appropriate order for implementation. I believe the problem with Roman's original query was that he listed the JOINs in a bad order. Unfortunately I didn't keep a copy of that message, and the list archives seem to be a day or more behind... but at least for these WHERE conditions, it looks like the best bet would to join m to b (I'm assuming m.merchid is unique), then to t, then to d, then add on the others. regards, tom lane
On Thu, 16 Jan 2003, Josh Berkus wrote: > Roman, > > > > Hey, Roman, how many records in BatchDetail, anyway? > > > > 23 million. > > And MSSQL is returning results in 3 seconds? I find that a bit hard > to believe, unless this query is called repeatedly and that's the > figure for the last call, where the records are being cached. I'll > have to look at your hardware descriptions again. > > > It seems to me that the big, big isolated problem is the index scan > > on batchdetail.tranamount. > > Nope. This was a misimpression caused by batchdetail waiting for a > bunch of other processes to complete. Sometimes the parallelizing > gives me a wrong impression of what's holding up the query. Sorry if I > confused you. I'm still not sure that it isn't a big part given that the time went down by a factor of about 4 when index scans were disabled and a sequence scan was done and that a sequence scan over the table with no other tables joined looked to take about 5 minutes itself and the difference between that seqscan and the big query was only about 20 seconds when enable_indexscan was off unless I'm misreading those results.
Tom, > The list of explicit JOINs as you have here is a good way to proceed > *if* you write the JOINs in an appropriate order for implementation. > I believe the problem with Roman's original query was that he listed > the JOINs in a bad order. Unfortunately I didn't keep a copy of that > message, and the list archives seem to be a day or more behind... > but at least for these WHERE conditions, it looks like the best bet > would to join m to b (I'm assuming m.merchid is unique), then to t, > then to d, then add on the others. I realize that I've contributed nothing other than bug reports to the parser design. But shouldn't Postgres, given a free hand, figure out the above automatically? I'd be embarassed if MS could one-up us in parser planning anywhere, theirs sucks on sub-selects .... -Josh Berkus
"Josh Berkus" <josh@agliodbs.com> writes: >> but at least for these WHERE conditions, it looks like the best bet >> would to join m to b (I'm assuming m.merchid is unique), then to t, >> then to d, then add on the others. > I realize that I've contributed nothing other than bug reports to the > parser design. But shouldn't Postgres, given a free hand, figure out > the above automatically? I believe it will. So far I've not seen an EXPLAIN from a query that was structured to give it a free hand. As noted elsewhere, the fact that we allow JOIN syntax to constrain the planner is a real pain if you are accustomed to databases that don't do that. On the other hand, it's a real lifesaver for people who need to pare the planning time for dozen-way joins; it was only a day or two back in this same mailing list that we last had a discussion about that end of the problem. So even though it started out as an implementation shortcut rather than an intended feature, I'm loathe to just disable the behavior entirely. regards, tom lane
On Wed, 15 Jan 2003, Roman Fail wrote: I just had new thoughts. If you make an index on batchdetail(batchid) does that help? I realized that it was doing a merge join to join d and the (t,b,m) combination when it was expecting 3 rows out of the latter, and batchid is presumably fairly selective on the batchdetail table, right? I'd have expected a nested loop over the id column, but it doesn't appear you have an index on it in batchdetail. Then I realized that batchheader.batchid and batchdetail.batchid don't even have the same type, and that's probably something else you'd need to fix. > batchheader has 2.6 million records: > CREATE TABLE public.batchheader ( > batchid int8 DEFAULT nextval('"batchheader_batchid_key"'::text) NOT NULL, > And here's batchdetail too, just for kicks. 23 million records. > CREATE TABLE public.batchdetail ( > batchid int4,
> Josh Berkus wrote: > And MSSQL is returning results in 3 seconds? I find that a bit hard > to believe, unless this query is called repeatedly and that's the > figure for the last call, where the records are being cached. I'll > have to look at your hardware descriptions again. Hardware-wise, the Postgres server is a hot rod and MSSQL is a basic vanilla server. I changed all the WHERE clauses toradically different values and couldn't get it to take more than 5 seconds on MSSQL. Most of it's cost savings seems tocome from some kind of "Table Spool/Lazy Spool" in it's execution plan, which looks to me like it only exists for the lifeof the query. You can read more about this at: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/optimsql/odp_tun_1_1m7g.asp http://msdn.microsoft.com/library/default.asp?url=/library/en-us/optimsql/odp_tun_1_7rjg.asp Maybe there are some good ideas here for Postgres. Unfortunately, the MSSQL Execution Plan is displayed graphically, andI can't figure out a way to get it to text without typing it all. I could do some screen shots if you really want tosee it. > Stephan Szabo wrote: > I'd assume that tranamount values are fairly randomly distributed > throughout the table, right? It takes about 5 minutes for the > system to read the entire table and more for the index scan, so > you're probably reading most of the table randomly and the index > as well. > What values on batchdetail do you use in query where clauses regularly? Yes, tranamount values are randomly distributed. I don't understand why an index scan would be "random", isn't the wholepoint of an index to have an ordered reference into the data? batchdetail has 5 columns that can be in the WHERE clause,all of which are indexed. None is more likely than the other to be searched, so a clustered index doesn't make muchsense to me. The whole thing needs to be fast. >> Nope. This was a misimpression caused by batchdetail waiting for a >> bunch of other processes to complete. Sometimes the parallelizing >> gives me a wrong impression of what's holding up the query. Sorry if I >> confused you. > >I'm still not sure that it isn't a big part given that the time went down >by a factor of about 4 when index scans were disabled and a sequence scan >was done and that a sequence scan over the table with no other tables >joined looked to take about 5 minutes itself and the difference between >that seqscan and the big query was only about 20 seconds when >enable_indexscan was off unless I'm misreading those results. You are not misreading the results. There was a huge difference. Nobody has ever made note of it, but this still seemsvery odd to me: *** 'sar -b' during the query with index scan: 6,000 block reads/sec with seq scan: 95,000 block reads/sec Tom, here is the EXPLAIN for your suggested version of the query with enable_indexscan=on. I performed the same query withenable_indexscan=off and total runtime was *much* better: 296174.60 msec. By the way, thank you for your detailed descriptionof how the JOIN order affects the outcome - I understand much better now. EXPLAIN ANALYZE SELECT ss.batchdate, ss.batchdetailid, ss.bankno, ss.trandate, ss.tranamount, ss.submitinterchange, ss.authamount, ss.authno, ss.cardtypeid, ss.mcccode, ss.name AS merchantname, ss.cardtype, ss.merchid, p1.localtaxamount, p1.productidentifier, dr.avsresponse, cr.checkoutdate, cr.noshowindicator, ck.checkingacctno, ck.abaroutingno, ck.checkno FROM (SELECT b.batchdate, d.batchdetailid, t.bankno, d.trandate, d.tranamount, d.submitinterchange, d.authamount, d.authno, d.cardtypeid, d.mcccode, m.name, c.cardtype, m.merchid FROM tranheader t, batchheader b, merchants m, cardtype c, batchdetail d WHERE t.tranheaderid=b.tranheaderid AND m.merchantid=b.merchantid AND d.batchid=b.batchid AND c.cardtypeid=d.cardtypeid AND t.clientid = 6 AND d.tranamount BETWEEN 500.0 AND 700.0 AND b.batchdate > '2002-12-15' AND m.merchid = '701252267') ss LEFT JOIN purc1 p1 on p1.batchdetailid=ss.batchdetailid LEFT JOIN direct dr ON dr.batchdetailid = ss.batchdetailid LEFT JOIN carrental cr ON cr.batchdetailid = ss.batchdetailid LEFT JOIN checks ck ON ck.batchdetailid = ss.batchdetailid ORDER BY ss.batchdate DESC LIMIT 50 Limit (cost=1601637.75..1601637.75 rows=1 width=285) (actual time=1221606.41..1221606.42 rows=5 loops=1) -> Sort (cost=1601637.75..1601637.75 rows=1 width=285) (actual time=1221606.40..1221606.41 rows=5 loops=1) Sort Key: b.batchdate -> Nested Loop (cost=1543595.18..1601637.74 rows=1 width=285) (actual time=1204815.02..1221606.27 rows=5 loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=1543595.18..1601581.23 rows=1 width=247) (actual time=1204792.38..1221560.42 rows=5loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=1543595.18..1601581.22 rows=1 width=230) (actual time=1204792.35..1221560.27 rows=5loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=1543595.18..1601581.21 rows=1 width=221) (actual time=1204792.31..1221560.09rows=5 loops=1) Join Filter: ("inner".batchdetailid = "outer".batchdetailid) -> Nested Loop (cost=1543595.18..1545529.17 rows=1 width=202) (actual time=1195376.48..1195578.86rows=5 loops=1) Join Filter: ("inner".tranheaderid = "outer".tranheaderid) -> Nested Loop (cost=1543595.18..1545449.98 rows=1 width=186) (actual time=1195370.72..1195536.53rows=5 loops=1) Join Filter: ("inner".cardtypeid = "outer".cardtypeid) -> Merge Join (cost=1543595.18..1545448.76 rows=1 width=172) (actual time=1195311.88..1195477.32rows=5 loops=1) Merge Cond: ("outer".batchid = "inner".batchid) -> Sort (cost=476.17..476.18 rows=4 width=102) (actual time=30.57..30.59rows=17 loops=1) Sort Key: b.batchid -> Nested Loop (cost=0.00..476.14 rows=4 width=102) (actual time=25.21..30.47rows=17 loops=1) -> Index Scan using merchants_ix_merchid_idx on merchantsm (cost=0.00..5.65 rows=1 width=78) (actual time=23.81..23.82 rows=1 loops=1) Index Cond: (merchid = '701252267'::character varying) -> Index Scan using batchheader_ix_merchantid_idx on batchheaderb (cost=0.00..470.30 rows=15 width=24) (actual time=1.38..6.55 rows=17 loops=1) Index Cond: ("outer".merchantid = b.merchantid) Filter: (batchdate > '2002-12-15 00:00:00'::timestampwithout time zone) -> Sort (cost=1543119.01..1544045.79 rows=370710 width=70) (actual time=1194260.51..1194892.79rows=368681 loops=1) Sort Key: d.batchid -> Index Scan using batchdetail_ix_tranamount_idx on batchdetaild (cost=0.00..1489103.46 rows=370710 width=70) (actual time=5.26..1186051.44 rows=370307 loops=1) Index Cond: ((tranamount >= 500.0) AND (tranamount <= 700.0)) -> Seq Scan on cardtype c (cost=0.00..1.10 rows=10 width=14) (actual time=11.77..11.79rows=10 loops=5) -> Seq Scan on tranheader t (cost=0.00..55.15 rows=1923 width=16) (actual time=0.02..5.46rows=1923 loops=5) Filter: (clientid = 6) -> Seq Scan on purc1 p1 (cost=0.00..44285.35 rows=941335 width=19) (actual time=10.79..3763.56rows=938770 loops=5) -> Seq Scan on direct dr (cost=0.00..0.00 rows=1 width=9) (actual time=0.00..0.00 rows=0 loops=5) -> Seq Scan on carrental cr (cost=0.00..0.00 rows=1 width=17) (actual time=0.00..0.00 rows=0 loops=5) -> Seq Scan on checks ck (cost=0.00..40.67 rows=1267 width=38) (actual time=0.77..7.15 rows=1267 loops=5) Total runtime: 1221645.52 msec > Tomasz Myrta wrote: > Are there any where clauses which all of theses variation have? Yes.....WHERE clientid = ? will appear in every query. The others are present based on user input. > Ron Johnson wrote: > What are the indexes on batchdetail? > There's one on batchid and a seperate one on tranamount? > If so, what about dropping them and create a single multi-segment > index on "batchid, tranamount". (A constraint can then enforce > uniqueness on batchid. There is no index on batchid, I think it is a good idea to create one. Stephan also suggested this. After I try the singlebatchid index, I might try to multi-segment index idea as well. I'll post results later today. > Stephan Szabo wrote: > Then I realized that batchheader.batchid and > batchdetail.batchid don't even have the same > type, and that's probably something else you'd > need to fix. Yes, that's a mistake on my part....batchdetail(batchid) should be an int8. It looks to me like converting this datatypecan't be done with a single ALTER TABLE ALTER COLUMN statement.....so I guess I'll work around it with an ADD, UPDATE,DROP, and RENAME. > Josh Berkus wrote: > Um, Roman, keep in mind this is a mailing list. I'm sure that > everyone here is happy to give you the tools to figure out how to fix > things, but only in a DIY fashion, and not on your schedule. I hate being defensive, but I don't remember saying that I expect anyone to fix my problems for me on my schedule. *I* hopethat *I* can get this done by Friday, because otherwise my boss is going to tell me to dump Postgres and install MSSQLon the server. I only mention this fact because it's a blow against PostgreSQL's reputation if I have to give up. There is no pressure on you, and I apologize if something I said sounded like whining. I am VERY grateful for the time that all of you have given to this problem. Roman Fail Sr. Web Application Programmer POS Portal, Inc.
On Thu, 16 Jan 2003, Roman Fail wrote: > > Stephan Szabo wrote: > > I'd assume that tranamount values are fairly randomly distributed > > throughout the table, right? It takes about 5 minutes for the > > system to read the entire table and more for the index scan, so > > you're probably reading most of the table randomly and the index > > as well. > > What values on batchdetail do you use in query where clauses regularly? > Yes, tranamount values are randomly distributed. I don't understand > why an index scan would be "random", isn't the whole point of an index > to have an ordered reference into the data? batchdetail has 5 columns > that can be in the WHERE clause, all of which are indexed. None is > more likely than the other to be searched, so a clustered index > doesn't make much sense to me. The whole thing needs to be fast. Yeah, in that case a clustered index doesn't help. Indexes give you an ordered way to find the rows that meet a condition, but say you had three rows in your table in this order (note that this is an amazing oversimplification): (1,'a') (2,'b') (0,'c') And you want to scan the index from values with the first number between 0 and 2. It reads the third row, then the first, then the second (to get the letter associated). Between those reads, it's got to seek back and forth through the heap file and the order in which it hits them is pretty random seeming (to the kernel). > > Ron Johnson wrote: > > What are the indexes on batchdetail? > > There's one on batchid and a seperate one on tranamount? > > If so, what about dropping them and create a single multi-segment > > index on "batchid, tranamount". (A constraint can then enforce > > uniqueness on batchid. > There is no index on batchid, I think it is a good idea to create > one. Stephan also suggested this. After I try the single batchid > index, I might try to multi-segment index idea as well. I'll post > results later today. I think we may all have misread the index list to include an index on batchid. Also you have two indexes on batchdetailid right now (primary key also creates one) which added to the confusion. > > Stephan Szabo wrote: > > Then I realized that batchheader.batchid and > > batchdetail.batchid don't even have the same > > type, and that's probably something else you'd > > need to fix. > > Yes, that's a mistake on my part....batchdetail(batchid) should be an > int8. It looks to me like converting this datatype can't be done with > a single ALTER TABLE ALTER COLUMN statement.....so I guess I'll work > around it with an ADD, UPDATE, DROP, and RENAME. Don't forget to do a vacuum full in there as well.
"Roman Fail" <rfail@posportal.com> writes: > -> Merge Join (cost=1543595.18..1545448.76 rows=1 width=172) (actual time=1195311.88..1195477.32rows=5 loops=1) > Merge Cond: ("outer".batchid = "inner".batchid) > -> Sort (cost=476.17..476.18 rows=4 width=102) (actual time=30.57..30.59rows=17 loops=1) > Sort Key: b.batchid > -> Nested Loop (cost=0.00..476.14 rows=4 width=102) (actual time=25.21..30.47rows=17 loops=1) > -> Index Scan using merchants_ix_merchid_idx on merchantsm (cost=0.00..5.65 rows=1 width=78) (actual time=23.81..23.82 rows=1 loops=1) > Index Cond: (merchid = '701252267'::character varying) > -> Index Scan using batchheader_ix_merchantid_idx on batchheaderb (cost=0.00..470.30 rows=15 width=24) (actual time=1.38..6.55 rows=17 loops=1) > Index Cond: ("outer".merchantid = b.merchantid) > Filter: (batchdate > '2002-12-15 00:00:00'::timestampwithout time zone) > -> Sort (cost=1543119.01..1544045.79 rows=370710 width=70) (actualtime=1194260.51..1194892.79 rows=368681 loops=1) > Sort Key: d.batchid > -> Index Scan using batchdetail_ix_tranamount_idx on batchdetaild (cost=0.00..1489103.46 rows=370710 width=70) (actual time=5.26..1186051.44 rows=370307 loops=1) > Index Cond: ((tranamount >= 500.0) AND (tranamount <= 700.0)) The expensive part of this is clearly the sort and merge of the rows extracted from batchdetail. The index on tranamount is not helping you at all, because the condition (between 500 and 700) isn't very selective --- it picks up 370000 rows --- and since those rows are totally randomly scattered in the table, you do a ton of random seeking. It's actually faster to scan the table linearly --- that's why enable_indexscan=off was faster. However, I'm wondering why the thing picked this plan, when it knew it would get only a few rows out of the m/b join (estimate 4, actual 17, not too bad). I would have expected it to use an inner indexscan on d.batchid. Either you've not got an index on d.batchid, or there's a datatype mismatch that prevents the index from being used. What are the datatypes of d.batchid and b.batchid, exactly? If they're not the same, either make them the same or add an explicit coercion to the query, like WHERE d.batchid = b.batchid::typeof_d_batchid regards, tom lane
Tom said: > datatype mismatch that prevents the index from being used. What are the > datatypes of d.batchid and b.batchid, exactly? If they're not the same, > either make them the same or add an explicit coercion to the query, like > WHERE d.batchid = b.batchid::typeof_d_batchid > It can be source of problem. I found in one of Roman's mail, that batchid is declared as int8 in master table and as int4 in detail table. Regards, Tomasz Myrta
Tom and Tomasz: I have to change the datatype of batchdetail.batchid from int4 to int8. After over 4 hours, the UPDATE transfer from theold column to new has not yet completed. After that I still have to build a new index and run VACUUM FULL. When thatis all done I'll re-run the various queries, including a specific small one that Josh requested. Chad Thompson suggested that I add single quotes around the literals in the WHERE clause, which sounded like a great ideabased on his experience. Unfortunately, it did not make the query any faster. But read on! For kicks, I tried this simple query, which should happen in an instant. It is the first row in the table. EXPLAIN ANALYZE select batchdetailid from batchdetail where batchdetailid = 27321; Seq Scan on batchdetail (cost=0.00..1960485.43 rows=1 width=8) (actual time=17.58..264303.76 rows=1 loops=1) Filter: (batchdetailid = 27321) Total runtime: 264303.87 msec Does it make sense to do a sequence scan when the primary key index is available? Even so, it's still a pretty horribletime given the hardware. HOWEVER.....look at this: EXPLAIN ANALYZE select batchdetailid from batchdetail where batchdetailid = 27321::bigint; Index Scan using batchdetail_pkey on batchdetail (cost=0.00..4.13 rows=1 width=8) (actual time=0.03..0.03 rows=1 loops=1) Index Cond: (batchdetailid = 27321::bigint) Total runtime: 0.07 msec It sort of feels like a magic moment. I went back and looked through a lot of the JOIN columns and found that I was mixingint4 with int8 in a lot of them. All of these tables (except batchdetail) were migrated using pgAdminII's migrationwizard, so I didn't really give a hard look at all the data types matching up since it has a nice data map (I usedthe defaults except for the money type). Now I think I'm just going to drop the entire database and reload the data from scratch, making sure that the data typesare mapped exactly right. Correct me if I'm wrong, but int4 only ranges from negative 2 billion to positive 2 billion. All the primary keys for my tables would fit in this range with the exception of batchdetail, which could conceivablygrow beyond 2 billion someday (although I'd be archiving a lot of it when it got that big). Maybe I just shouldn'tworry about it for now and make everything int4 for simplicity. I doubt I will accomplish all this on Friday, but I'll give a full report once I get it all reloaded. > Stephan Szabo wrote: > Also you have two indexes on batchdetailid right now (primary key > also creates one) which added to the confusion. The 7.3.1 docs for CREATE TABLE don't mention anything about automatic index creation for a PRIMARY KEY. I didn't see anyPK indexes via pgAdminII, so I read this line from the docs and decided to create them separately. "Technically, PRIMARY KEY is merely a combination of UNIQUE and NOT NULL" However, this query proves you are right: trans=# select relname, relpages, indisunique, indisprimary from pg_class, pg_index trans-# where indexrelid in (37126739, 8604257) and pg_class.oid = pg_index.indexrelid; relname | relpages | indisunique | indisprimary ----------------------------------+----------+-------------+-------------- batchdetail_pkey | 121850 | t | t batchdetail_ix_batchdetailid_idx | 63934 | f | f All other columns in the two tables are identical for these two indexes. So now I've gone through and deleted all of theseduplicate indexes I created (and then a VACUUM FULL). Perhaps an extra sentence in the docs might prevent someone elsefrom making the same mistake as I? *** Current postgresql.conf settings: tcpip_socket=true shared_buffers = 131072 max_fsm_relations = 10000 max_fsm_pages = 2000000 sort_mem = 32768 default_statistics_target = 30 Thanks again for all your help! Roman Fail Sr. Web Application Developer POS Portal, Inc.
"Roman Fail" <rfail@posportal.com> writes: > shared_buffers = 131072 Yipes! Try about a tenth that much. Or less. regards, tom lane
On Thu, 16 Jan 2003, Roman Fail wrote: > > Stephan Szabo wrote: > > Also you have two indexes on batchdetailid right now (primary key > > also creates one) which added to the confusion. > > The 7.3.1 docs for CREATE TABLE don't mention anything about automatic > index creation for a PRIMARY KEY. I didn't see any PK indexes via > pgAdminII, so I read this line from the docs and decided to create > them separately. > "Technically, PRIMARY KEY is merely a combination of UNIQUE and NOT NULL" Right, but the implementation of UNIQUE constraints in postgresql right now is through a unique index. That's not necessarily a guarantee for the future, but for right now you can rely on it.
I'd love to see this as a TODO item, but I am hardly one to add to the list... Charlie Bruce Momjian wrote: >Is this a TODO item? > >--------------------------------------------------------------------------- > >Charles H. Woloszynski wrote: > > >>I was surprised to hear that JOIN syntax constrained the planner. We >>have a policy of using JOIN syntax to describe the table relationships >>and where clauses to describe the selection process for our queries. It >>was our understanding that the JOIN syntax was introduced to support >>this approach, but not to contrain the planner. >> >>Is there any way to sell the planner to consider JOIN syntax as >>equivalent to WHERE clauses and to not use them to force the planner >>down a specific path? Can we get that added as an option (and then made >>available to use JDBC folks as a URL parameter). It would make my team >>very happy :-). >> >> >>I think that making this an option will help all those migrating to >>Postgres who did not expect that JOINs forced the planner down specific >>plans. Is it possible/reasonable to add? >> >>Charlie >> >> >>Tom Lane wrote: >> >> >> >>>"Roman Fail" <rfail@posportal.com> writes: >>> >>> >>> >>> >>>>Thanks to everyone for the quick replies! I'm sure that my lack of >>>>skill with SQL queries is the main problem. What's strange to me is >>>>how MSSQL takes my bad queries and makes them look good anyway. It >>>>must have a real smart planner. >>>> >>>> >>>> >>>> >>>I think more likely the issue is that your use of JOIN syntax is forcing >>>Postgres into a bad plan. MSSQL probably doesn't assign any semantic >>>significance to the use of "a JOIN b" syntax as opposed to "FROM a, b" >>>syntax. Postgres does. Whether this is a bug or a feature depends on >>>your point of view --- but there are folks out there who find it to be >>>a life-saver. You can find some explanations at >>>http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/explicit-joins.html >>> >>> >>> >>> >>> >>>>Is it pretty much universally accepted that I should drop all my >>>>foreign keys? >>>> >>>> >>>> >>>> >>>No. They don't have any effect on SELECT performance in Postgres. >>>They will impact update speed, but that's not your complaint (at the >>>moment). Don't throw away data integrity protection until you know >>>you need to. >>> >>> regards, tom lane >>> >>>---------------------------(end of broadcast)--------------------------- >>>TIP 4: Don't 'kill -9' the postmaster >>> >>> >>> >>> >>-- >> >> >>Charles H. Woloszynski >> >>ClearMetrix, Inc. >>115 Research Drive >>Bethlehem, PA 18015 >> >>tel: 610-419-2210 x400 >>fax: 240-371-3256 >>web: www.clearmetrix.com >> >> >> >> >> >> >>---------------------------(end of broadcast)--------------------------- >>TIP 2: you can get off all lists at once with the unregister command >> (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >> >> >> > > > -- Charles H. Woloszynski ClearMetrix, Inc. 115 Research Drive Bethlehem, PA 18015 tel: 610-419-2210 x400 fax: 240-371-3256 web: www.clearmetrix.com
On Thu, 16 Jan 2003, Roman Fail wrote: > > HOWEVER.....look at this: > EXPLAIN ANALYZE select batchdetailid from batchdetail where batchdetailid = 27321::bigint; > Index Scan using batchdetail_pkey on batchdetail (cost=0.00..4.13 rows=1 width=8) (actual time=0.03..0.03 rows=1 loops=1) > Index Cond: (batchdetailid = 27321::bigint) > Total runtime: 0.07 msec > We had this happen to us - we had a serial8 column (int8) and our query was straight forward where id = 12345; which ran craptacularly. After much head banging and cursing I had tried where id = '12345' and it magically worked. I think the parser is interpreting a "number" to be an int4 instead of int8. (instead of quotes you can also cast via 12345::int8 like you did) Perhaps this should go on the TODO - when one side is an int8 and the other is a literal number assume the number to be int8 instead of int4? ------------------------------------------------------------------------------ Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ Ronald McDonald, with the help of cheese soup, controls America from a secret volkswagon hidden in the past -------------------------------------------------------------------------------
Jeff <threshar@torgo.978.org> writes: > Perhaps this should go on the TODO - when one side is an int8 and the > other is a literal number assume the number to be int8 instead of int4? It's been on TODO for so long that it's buried near the bottom. * Allow SELECT * FROM tab WHERE int2col = 4 to use int2col index, int8, float4, numeric/decimal too [optimizer] This behavior interacts with enough other stuff that we can't just change it willy-nilly. See many past discussions in the pghackers archives if you want details. A recent example of a promising-looking fix crashing and burning is http://fts.postgresql.org/db/mw/msg.html?mid=1357121 regards, tom lane
On Fri, Jan 17, 2003 at 09:00:19AM -0500, Jeff wrote: > Perhaps this should go on the TODO - when one side is an int8 and the > other is a literal number assume the number to be int8 instead of int4? Actually, this is a broader problem having to do with type coercion. There are a couple of TODO items which refer to this, it looks to me, but in any case there has been _plenty_ of discussion on -general and -hackers about what's wrong here. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Tom, > > shared_buffers = 131072 > > Yipes! Try about a tenth that much. Or less. Why? He has 4GB RAM on the machine. -Josh Berkus
Hi, I have a challenging (for me) SQL question: Two tables (Note: these are fictitious, the real tables actually make sense, so no need to re-design our table structure) Table 1 id | name | count ------------------------ 1 | foo | 10 1 | foo | 20 2 | bar | 100 Table 2 id | f1 | f2 | t1ref ----------------------- 1 | 10 | 20 | 1 2 | 50 | 40 | 2 The question: I want to do the following select: select table2.f1, table1.name from table1,table2 where table1.id = table 2.id and table2.id = 2; The problem is that I really only need the name from table2 returned once. With this query, I get two records back. Clearly this is because of the join that I am doing. Is there a different way to perform this join, so that I only get back ONE record from table1 that matches? Thanks, -Noah
"Josh Berkus" <josh@agliodbs.com> writes: >>> shared_buffers = 131072 >> >> Yipes! Try about a tenth that much. Or less. > Why? He has 4GB RAM on the machine. I think a gig of shared buffers is overkill no matter what. One reason not to crank up shared_buffers "just because you can" is that there are operations (such as CHECKPOINT) that have to scan through all the buffers, linearly. I don't *think* any of these are in performance-critical paths, but nonetheless you're wasting CPU. I trust the kernel to manage a huge number of buffers efficiently more than I trust Postgres. There's another issue, which is somewhat platform-dependent; I'm not sure if it applies to whatever OS Roman is using. But if you have a machine where SysV shared memory is not locked into RAM, then a huge shared buffer arena creates the probability that some of it will be touched seldom enough that the kernel will decide to swap it out. When that happens, you *really* pay through the nose --- a page that you might have been able to get from kernel cache without incurring I/O will now certainly cost you I/O to touch. It's even worse if the buffer contained a dirty page at the time it was swapped out --- now that page is going to require being read back in and written out again, a net cost of three I/Os where there should have been one. Bottom line is that shared_buffers should be kept small enough that the space all looks like a hot spot to the kernel's memory allocation manager. In short, I believe in keeping shared_buffers relatively small --- one to ten thousand seems like the right ballpark --- and leaving the kernel to allocate the rest of RAM as kernel disk cache. I have been thinking recently about proposing that we change the factory default shared_buffers to 1000, which if this line of reasoning is correct would eliminate the need for average installations to tune it. The reason the default is 64 is that on some older Unixen, the default SHMMAX is only one meg --- but it's been a long time since our default shared memory request was less than a meg anyway, because of bloat in other components of shared memory. It's probably time to change the default to something more reasonable from a performance standpoint, and put some burden on users of older Unixen to either reduce the setting or fix their SHMMAX parameter. regards, tom lane
Tom Lane wrote: > "Josh Berkus" <josh@agliodbs.com> writes: > >>> shared_buffers = 131072 > >> > >> Yipes! Try about a tenth that much. Or less. > > > Why? He has 4GB RAM on the machine. > > I think a gig of shared buffers is overkill no matter what. > > One reason not to crank up shared_buffers "just because you can" is that > there are operations (such as CHECKPOINT) that have to scan through all > the buffers, linearly. I don't *think* any of these are in > performance-critical paths, but nonetheless you're wasting CPU. I trust > the kernel to manage a huge number of buffers efficiently more than I > trust Postgres. > > There's another issue, which is somewhat platform-dependent; I'm not > sure if it applies to whatever OS Roman is using. But if you have a > machine where SysV shared memory is not locked into RAM, then a huge > shared buffer arena creates the probability that some of it will be > touched seldom enough that the kernel will decide to swap it out. When > that happens, you *really* pay through the nose --- a page that you > might have been able to get from kernel cache without incurring I/O will > now certainly cost you I/O to touch. It's even worse if the buffer > contained a dirty page at the time it was swapped out --- now that page > is going to require being read back in and written out again, a net cost > of three I/Os where there should have been one. Bottom line is that > shared_buffers should be kept small enough that the space all looks like > a hot spot to the kernel's memory allocation manager. Just as a data point, I believe other database systems recommend very large shared memory areas if a lot of data is being accessed. I seem to remember Informix doing that. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Fri, Jan 17, 2003 at 12:33:11PM -0500, Tom Lane wrote: > One reason not to crank up shared_buffers "just because you can" is that > there are operations (such as CHECKPOINT) that have to scan through all > the buffers, linearly. I don't *think* any of these are in > performance-critical paths, but nonetheless you're wasting CPU. I trust > the kernel to manage a huge number of buffers efficiently more than I > trust Postgres. For what it's worth, we have exactly that experience on our Sun E4500s. I had machines with 12 gig I was testing on, and I increased the buffers to 2 Gig, because truss was showing us some sluggishness in the system was tripping on the system call to get a page. It was satisifed right away by the kernel's cache, but the system call was still the most expensive part of the operation. After we'd increased the shared buffers, however, performance _degraded_ considerably. It now spent all its time instead managing the huge shared buffer, and the cost of that was much worse than the cost of the system call. So it is extremely dependent on the efficiency of PostgreSQL's use of shared memory as compared to the efficiency of the system call. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Just as a data point, I believe other database systems recommend very > large shared memory areas if a lot of data is being accessed. I seem to > remember Informix doing that. Yeah, but isn't that theory a hangover from pre-Unix operating systems? In all modern Unixen, you can expect the kernel to make use of any spare RAM for disk buffer cache --- and that behavior makes it pointless for Postgres to try to do large amounts of its own buffering. Having a page in our own buffer instead of kernel buffer saves a context swap to access the page, but it doesn't save I/O, so the benefit is a lot less than you might think. I think there's seriously diminishing returns in pushing shared_buffers beyond a few thousand, and once you get to the point where it distorts the kernel's ability to manage memory for processes, you're really shooting yourself in the foot. regards, tom lane
On Fri, 17 Jan 2003, Tom Lane wrote: > > Yeah, but isn't that theory a hangover from pre-Unix operating systems? > In all modern Unixen, you can expect the kernel to make use of any spare > RAM for disk buffer cache --- and that behavior makes it pointless for > Postgres to try to do large amounts of its own buffering. > Informix, oracle, etc all do raw device access bypassing the kernels buffering, etc. So they need heaping gobules of memory to do the same thing the kernel does.. but since they know the exact patterns of data and how things will be done they can fine tune their buffer caches to get much better performance than the kernel (15-20% in informix's case) since the kernel needs to be a "works good generally" probably the desire to crank that up stems from using those other db's I know I used to do that with pgsql. (Ahh, I crank that setting up through the roof on informix, I'll do the same with pg) perhaps a FAQ entry or comment in the shipped config about it? I think if people realize it isn't quite the same as what it does in oracle/informix/etc then they'll be less inclined to cranking it. ------------------------------------------------------------------------------ Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ Ronald McDonald, with the help of cheese soup, controls America from a secret volkswagon hidden in the past -------------------------------------------------------------------------------
Jeff <threshar@torgo.978.org> writes: > On Fri, 17 Jan 2003, Tom Lane wrote: >> Yeah, but isn't that theory a hangover from pre-Unix operating systems? > Informix, oracle, etc all do raw device access bypassing the kernels > buffering, etc. So they need heaping gobules of memory to do the same > thing the kernel does.. D'oh, I believe Jeff's put his finger on it. You need lotsa RAM if you are trying to bypass the OS. But Postgres would like to work with the OS, not bypass it. > but since they know the exact patterns of data and > how things will be done they can fine tune their buffer caches to get much > better performance than the kernel (15-20% in informix's case) since the > kernel needs to be a "works good generally" They go to all that work for 15-20% ??? Remind me not to follow that primrose path. I can think of lots of places where we can buy 20% for less work than implementing (and maintaining) our own raw-device access layer. > perhaps a FAQ entry or comment in the shipped config about it? > I think if people realize it isn't quite the same as what it does in > oracle/informix/etc then they'll be less inclined to cranking it. Good thought. But we do need to set the default to something a tad more realistic-for-2003 than 64 buffers ... regards, tom lane
On Thu, 16 Jan 2003, Roman Fail wrote: > It sort of feels like a magic moment. I went back and looked through a > lot of the JOIN columns and found that I was mixing int4 with int8 in a > lot of them. There is note about it in the docs: http://www.postgresql.org/idocs/index.php?datatype.html#DATATYPE-INT I don't know if this is in a faq anywhere, but it should be. I myself have helped a number of persons with this. Every once in a while there come someone in to the #postgresql irc channel with the exact same problem. Usually they leave the channel very happy, when their queries take less then a second instead of minutes. -- /Dennis
On Fri, Jan 17, 2003 at 11:49:31PM -0500, Tom Lane wrote: > Jeff <threshar@torgo.978.org> writes: > > Informix, oracle, etc all do raw device access bypassing the kernels > > buffering, etc. So they need heaping gobules of memory to do the same > > thing the kernel does.. > > D'oh, I believe Jeff's put his finger on it. You need lotsa RAM if you > are trying to bypass the OS. But Postgres would like to work with the > OS, not bypass it. One of the interesting things I have been playing with on Solaris recently is the various no-buffer settings you can give to the kernel for filesystems. The idea is that you don't have the kernel do the buffering, and you set your database's shared memory setting _reeeeal_ high. As nearly as I can tell, there is again no benefit with PostgreSQL. I'd also be amazed if this approach is a win for other systems. But a lot of DBAs seem to believe that they know better than their computers which tables are "really" accessed frequently. I think they must be smarter than I am: I'd rather trust a system that was designed to track these things and change the tuning on the fly, myself. (To be fair, there are some cases where you have an infrequently-accessed table which nevertheless is required to be fast for some reason or other, so you might want to force it to stay in memory.) A ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On 17 Jan 2003 at 12:33, Tom Lane wrote: > "Josh Berkus" <josh@agliodbs.com> writes: > >>> shared_buffers = 131072 > >> > >> Yipes! Try about a tenth that much. Or less. > > > Why? He has 4GB RAM on the machine. > > I think a gig of shared buffers is overkill no matter what. > > One reason not to crank up shared_buffers "just because you can" is that > there are operations (such as CHECKPOINT) that have to scan through all > the buffers, linearly. I don't *think* any of these are in > performance-critical paths, but nonetheless you're wasting CPU. I trust Assuming that one knows what he/she is doing, would it help in such cases i.e. the linear search thing, to bump up page size to day 16K/32K? and that is also the only way to make postgresql use more than couple of gigs of RAM, isn't it? Bye Shridhar -- Arithmetic: An obscure art no longer practiced in the world's developed countries.
"Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes: > On 17 Jan 2003 at 12:33, Tom Lane wrote: >> One reason not to crank up shared_buffers "just because you can" is that >> there are operations (such as CHECKPOINT) that have to scan through all >> the buffers, linearly. I don't *think* any of these are in >> performance-critical paths, but nonetheless you're wasting CPU. I trust > Assuming that one knows what he/she is doing, would it help in such cases i.e. > the linear search thing, to bump up page size to day 16K/32K? You mean increase page size and decrease the number of buffers proportionately? It'd save on buffer-management overhead, but I wouldn't assume there'd be an overall performance gain. The system would have to do more I/O per page read or written; which might be a wash for sequential scans, but I bet it would hurt for random access. > and that is also the only way to make postgresql use more than couple of gigs > of RAM, isn't it? It seems quite unrelated. The size of our shared memory segment is limited by INT_MAX --- chopping it up differently won't change that. In any case, I think worrying because you can't push shared buffers above two gigs is completely wrongheaded, for the reasons already discussed in this thread. The notion that Postgres can't use more than two gig because its shared memory is limited to that is *definitely* wrongheaded. We can exploit however much memory your kernel can manage for kernel disk cache. regards, tom lane
On 20 Jan 2003 at 2:14, Tom Lane wrote: > "Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes: > > Assuming that one knows what he/she is doing, would it help in such cases i.e. > > the linear search thing, to bump up page size to day 16K/32K? > > You mean increase page size and decrease the number of buffers > proportionately? It'd save on buffer-management overhead, but > I wouldn't assume there'd be an overall performance gain. The > system would have to do more I/O per page read or written; which > might be a wash for sequential scans, but I bet it would hurt for > random access. Right. But it has its own applications. If I am saving huge data blocks like say gene stuff, I might be better off living with a relatively bigger page fragmentation. > > and that is also the only way to make postgresql use more than couple of gigs > > of RAM, isn't it? > > It seems quite unrelated. The size of our shared memory segment is > limited by INT_MAX --- chopping it up differently won't change that. Well, if my page size is doubled, I can get double amount of shared buffers. That was the logic nothing else. > In any case, I think worrying because you can't push shared buffers > above two gigs is completely wrongheaded, for the reasons already > discussed in this thread. The notion that Postgres can't use more > than two gig because its shared memory is limited to that is > *definitely* wrongheaded. We can exploit however much memory your > kernel can manage for kernel disk cache. Well, I agree completely. However there are folks and situation which demands things because they can be done. This is just to check out the absolute limit what it can manage. Bye Shridhar -- Bagdikian's Observation: Trying to be a first-rate reporter on the average American newspaper is like trying to play Bach's "St. Matthew Passion" on a ukelele.
On Mon, 2003-01-20 at 01:14, Tom Lane wrote: > "Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes: > > On 17 Jan 2003 at 12:33, Tom Lane wrote: [snip] > > and that is also the only way to make postgresql use more than couple of gigs > > of RAM, isn't it? > > It seems quite unrelated. The size of our shared memory segment is > limited by INT_MAX --- chopping it up differently won't change that. > > In any case, I think worrying because you can't push shared buffers > above two gigs is completely wrongheaded, for the reasons already > discussed in this thread. The notion that Postgres can't use more > than two gig because its shared memory is limited to that is > *definitely* wrongheaded. We can exploit however much memory your > kernel can manage for kernel disk cache. http://www.redhat.com/services/techsupport/production/GSS_caveat.html "RAM Limitations on IA32 Red Hat Linux releases based on the 2.4 kernel -- including Red Hat Linux 7.1, 7.2, 7.3 and Red Hat Linux Advanced Server 2.1 -- support a maximum of 16GB of RAM." So if I have some honking "big" Compaq Xeon SMP server w/ 16GB RAM, and top(1) shows that there is 8GB of buffers, then Pg will be happy as a pig in the mud? -- +------------------------------------------------------------+ | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | "Basically, I got on the plane with a bomb. Basically, I | | tried to ignite it. Basically, yeah, I intended to damage | | the plane." | | RICHARD REID, who tried to blow up American Airlines | | Flight 63 | +------------------------------------------------------------+
Ron Johnson <ron.l.johnson@cox.net> writes: > So if I have some honking "big" Compaq Xeon SMP server w/ 16GB RAM, > and top(1) shows that there is 8GB of buffers, then Pg will be happy > as a pig in the mud? Sounds good to me ... regards, tom lane
>> It sort of feels like a magic moment. I went back and looked through a >> lot of the JOIN columns and found that I was mixing int4 with int8 in a >> lot of them. >There is note about it in the docs: >http://www.postgresql.org/idocs/index.php?datatype.html#DATATYPE-INT > >I don't know if this is in a faq anywhere, but it should be. I myself have >helped a number of persons with this. Every once in a while there come >someone in to the #postgresql irc channel with the exact same problem. >Usually they leave the channel very happy, when their queries take less >then a second instead of minutes. > >-- >/Dennis I'm really surprised that this issue doesn't pop up all the time. As the community grows, I think it will start to. I camevery, very close to dropping PostgreSQL entirely because of it. Hopefully the TODO issue on implicit type casting willmove closer to the top of the hackers list. But I'm just a beggar so I won't pretend to be a chooser. Back to my original problems: I re-created everything from scratch and made sure there are no int8's in my entire database. I found a few more places that I could create useful indexes as well. I didn't get to test it over the weekend,but today I played with it for several hours and could not get the queries to perform much better than last week. I was about ready to give up, throw Postgres in the junk pile, and get out the MSSQL CD. Luckily, an unrelated post on one of the lists mentioned something about ANALYZE, and I realized that I had forgotten torun it after all the new data was imported (although I did remember a VACUUM FULL). After running ANALYZE, I started gettingamazing results.....like a query that took 20 minutes last week was taking only 6 milliseconds now. That kicks theMSSQL server's ass all over the map (as I had originally expected it would!!!). So things are working pretty good now....and it looks like the whole problem was the data type mismatch issue. I hate topoint fingers, but the pgAdminII Migration Wizard forces all your primary keys to be int8 even if you set the Type Mapto int4. The second time through I recognized this and did a pg_dump so I could switch everything to int4. Now I'm goingto write some minor mods in my Java programs for PGSQL-syntax compatibility, and will hopefully have the PostgreSQLserver in production shortly. THANK YOU to everyone on pgsql-performance for all your help. You are the reason that I'll be a long term member of thePostgres community. I hope that I can assist someone else out in the future. Roman Fail Sr. Web Application Developer POS Portal, Inc.
Roman, > I'm really surprised that this issue doesn't pop up all the time. As > the community grows, I think it will start to. Actually, in the general sense of intelligent casting, the issue *does* come up all the time. Unfortunately, this is one of those issues that requires both an inspired solution and program-wide overhaul work to fix. In fact, in the FTP achives you can find an alternate version of Postgres (7.1 I think) where someone tried to fix the "stupid casting" issue and succeeded in making Postgres crash and burn instead. > Luckily, an unrelated post on one of the lists mentioned something > about ANALYZE, and I realized that I had forgotten to run it after > all the new data was imported (although I did remember a VACUUM > FULL). After running ANALYZE, I started getting amazing > results.....like a query that took 20 minutes last week was taking > only 6 milliseconds now. That kicks the MSSQL server's ass all over > the map (as I had originally expected it would!!!). That's great! > So things are working pretty good now....and it looks like the whole > problem was the data type mismatch issue. I hate to point fingers, > but the pgAdminII Migration Wizard forces all your primary keys to be > int8 even if you set the Type Map to int4. So? Send Dave Page (address at pgadmin.postgresql.org) a quick note documenting the problem. I'm sure he'll patch it, or at least fix it for PGAdmin III. > THANK YOU to everyone on pgsql-performance for all your help. You > are the reason that I'll be a long term member of the Postgres > community. I hope that I can assist someone else out in the future. You're welcome! If you can get your boss to authorize it, the Advocacy page (advocacy.postgresql.org) could use some more business testimonials. -Josh Berkus
> Jochem van Dieten wrote: > Just out of curiosity and for archiving purposes, could you post the new > EXPLAIN ANALYZE output to the list? To reiterate, the batchdetail table is 24 million rows, batchheader is 2.7 million, and purc1 is 1 million. The rest are2000 rows or less. I think having the 6-disk RAID-10 devoted to /usr/local/pgsql/data helps out a little here. I didtry changing the WHERE clauses to radically different values and it was still just as fast. This is the original queryI was working with (plus suggested modifications from the list): EXPLAIN ANALYZE SELECT ss.batchdate, ss.batchdetailid, ss.bankno, ss.trandate, ss.tranamount, ss.submitinterchange, ss.authamount, ss.authno, ss.cardtypeid, ss.mcccode, ss.name AS merchantname, ss.cardtype, ss.merchid, p1.localtaxamount, p1.productidentifier, dr.avsresponse, cr.checkoutdate, cr.noshowindicator, ck.checkingacctno, ck.abaroutingno, ck.checkno FROM (SELECT b.batchdate, d.batchdetailid, t.bankno, d.trandate, d.tranamount, d.submitinterchange, d.authamount, d.authno, d.cardtypeid, d.mcccode, m.name, c.cardtype, m.merchid FROM tranheader t, batchheader b, merchants m, cardtype c, batchdetail d WHERE t.tranheaderid=b.tranheaderid AND m.merchantid=b.merchantid AND d.batchid=b.batchid AND c.cardtypeid=d.cardtypeid AND t.clientid = 6 AND d.tranamount BETWEEN 500.0 AND 700.0 AND b.batchdate > '2002-12-15' AND m.merchid = '701252267') ss LEFT JOIN purc1 p1 on p1.batchdetailid=ss.batchdetailid LEFT JOIN direct dr ON dr.batchdetailid = ss.batchdetailid LEFT JOIN carrental cr ON cr.batchdetailid = ss.batchdetailid LEFT JOIN checks ck ON ck.batchdetailid = ss.batchdetailid ORDER BY ss.batchdate DESC LIMIT 50 Limit (cost=1351.93..1351.93 rows=1 width=261) (actual time=5.34..5.36 rows=8 loops=1) -> Sort (cost=1351.93..1351.93 rows=1 width=261) (actual time=5.33..5.34 rows=8 loops=1) Sort Key: b.batchdate -> Nested Loop (cost=0.01..1351.92 rows=1 width=261) (actual time=1.61..5.24 rows=8 loops=1) -> Hash Join (cost=0.01..1346.99 rows=1 width=223) (actual time=1.58..5.06 rows=8 loops=1) Hash Cond: ("outer".batchdetailid = "inner".batchdetailid) -> Hash Join (cost=0.00..1346.98 rows=1 width=210) (actual time=1.21..4.58 rows=8 loops=1) Hash Cond: ("outer".batchdetailid = "inner".batchdetailid) -> Nested Loop (cost=0.00..1346.97 rows=1 width=201) (actual time=0.82..4.05 rows=8 loops=1) -> Nested Loop (cost=0.00..1343.84 rows=1 width=182) (actual time=0.78..3.82 rows=8 loops=1) Join Filter: ("inner".cardtypeid = "outer".cardtypeid) -> Nested Loop (cost=0.00..1342.62 rows=1 width=172) (actual time=0.74..3.35 rows=8loops=1) -> Nested Loop (cost=0.00..539.32 rows=4 width=106) (actual time=0.17..1.61rows=26 loops=1) -> Nested Loop (cost=0.00..515.48 rows=5 width=94) (actual time=0.13..1.01rows=26 loops=1) -> Index Scan using merchants_ix_merchid_idx on merchants m (cost=0.00..5.65rows=1 width=78) (actual time=0.07..0.08 rows=1 loops=1) Index Cond: (merchid = '701252267'::character varying) -> Index Scan using batchheader_ix_merchantid_idx on batchheaderb (cost=0.00..508.56 rows=20 width=16) (actual time=0.04..0.81 rows=26 loops=1) Index Cond: ("outer".merchantid = b.merchantid) Filter: (batchdate > '2002-12-15'::date) -> Index Scan using tranheader_pkey on tranheader t (cost=0.00..5.08rows=1 width=12) (actual time=0.01..0.01 rows=1 loops=26) Index Cond: (t.tranheaderid = "outer".tranheaderid) Filter: (clientid = 6) -> Index Scan using batchdetail_ix_batchid_idx on batchdetail d (cost=0.00..186.81rows=2 width=66) (actual time=0.05..0.06 rows=0 loops=26) Index Cond: (d.batchid = "outer".batchid) Filter: ((tranamount >= 500.0) AND (tranamount <= 700.0)) -> Seq Scan on cardtype c (cost=0.00..1.10 rows=10 width=10) (actual time=0.00..0.03rows=10 loops=8) -> Index Scan using purc1_ix_batchdetailid_idx on purc1 p1 (cost=0.00..3.12 rows=1 width=19)(actual time=0.01..0.01 rows=0 loops=8) Index Cond: (p1.batchdetailid = "outer".batchdetailid) -> Hash (cost=0.00..0.00 rows=1 width=9) (actual time=0.00..0.00 rows=0 loops=1) -> Seq Scan on direct dr (cost=0.00..0.00 rows=1 width=9) (actual time=0.00..0.00 rows=0loops=1) -> Hash (cost=0.00..0.00 rows=1 width=13) (actual time=0.01..0.01 rows=0 loops=1) -> Seq Scan on carrental cr (cost=0.00..0.00 rows=1 width=13) (actual time=0.00..0.00 rows=0loops=1) -> Index Scan using checks_ix_batchdetailid_idx on checks ck (cost=0.00..4.92 rows=1 width=38) (actual time=0.01..0.01rows=0 loops=8) Index Cond: (ck.batchdetailid = "outer".batchdetailid) Total runtime: 5.89 msec
> >> Yeah, but isn't that theory a hangover from pre-Unix operating systems? > > > Informix, oracle, etc all do raw device access bypassing the kernels > > buffering, etc. So they need heaping gobules of memory to do the same > > thing the kernel does.. > > D'oh, I believe Jeff's put his finger on it. You need lotsa RAM if you > are trying to bypass the OS. But Postgres would like to work with the > OS, not bypass it. > > > but since they know the exact patterns of data and > > how things will be done they can fine tune their buffer caches to get much > > better performance than the kernel (15-20% in informix's case) since the > > kernel needs to be a "works good generally" > > They go to all that work for 15-20% ??? Remind me not to follow that > primrose path. I can think of lots of places where we can buy 20% for > less work than implementing (and maintaining) our own raw-device access > layer. This is related somewhat to the raw device access discussion. This is a quote from Matt Dillion (FreeBSD VM guru) on the topic of disk caches (Message-Id: <200301270657.h0R6v2qH071774@apollo.backplane.com>) and a few bits at the end: ### Begin quote Mmmmm. Basically what it comes down to is that without foreknowledge of the data locations being accessed, it is not possible for any cache algorithm to adapt to all the myriad ways data might be accessed. If you focus the cache on one methodology it will probably perform terribly when presented with some other methodology. What this means is that for the cases where you *KNOW* how a program intends to access a dataset larger then main memory, it is better to have the program explicitly cache/not-cache the data under program control rather then trying to force the system cache to adapt. I'll also be nice and decode some of Terry's Jargon for the rest of the readers. :will result in significant failure of random page replacement to :result in cache hits; likewise, going to 85% overage will practically :guarantee an almost 100% failure rate, as cyclical access with random :replacement is statistically more likely, in aggregate, to replace :the pages which are there longer (the probability is iterative and :additive: it's effectively a permutation). What Terry is saying is that if you have a dataset that is 2x the size of your cache, the cache hit rate on that data with random page replacement is NOT going to be 50%. This is because with random page replacement the likelihood of a piece of data being found in the cache depends on how long the data has been sitting in the cache. The longer the data has been sitting in the cache, the less likely you will find it when you need it (because it is more likely to have been replaced by the random replacement algorithm over time). So, again, the best caching methodology to use in the case where you *know* how much data you will be accessing and how you will access it is to build the caching directly into your program and not depend on system caching at all (for datasets which are far larger then main memory). This is true of all systems, not just BSD. This is one reason why databases do their own caching (another is so databases know when an access will require I/O for scheduling reasons, but that's a different story). The FreeBSD VM caching system does prevent one process from exhausting another process's cached data due to a sequential access, but the FreeBSD VM cache does not try to outsmart sequential data accesses to datasets which are larger then available cache space because it's an insanely difficult (impossible) problem to solve properly without foreknowledge of what data elements will be accessed when. This isn't to say that we can't improve on what we have now. I certainly think we can. But there is no magic bullet that will deal with every situation. -Matt ### End quote So if there really is only a 15-20% performance gain to be had from using raw disk access, that 15-20% loss comes from not being able to tell the OS what to cache, what not to cache, and what order to have the pages in... which only really matters if there is RAM available to the kernel to cache, and that it is able to determine what is valuable to cache in the course of its operations. Predictive caching by the OS isn't done because it understands PostgreSQL, because it understands a generic algorithm for page hits/misses. What is interesting after reading this, however, is the prospect of a 15-20% speed up on certain tables that we know are accessed frequently by implicitly specifying a set of data to be preferred in a user space cache. It's impossible for the OS to cache the pages that make the biggest impact on user visible performance given the OS has no understanding of what pages make a big difference on user visible performance, a user land database process, however, would. As things stand, it's entirely possible for a set of large queries to come through and wipe the kernel's cache that smaller queries were using. Once a cache misses, the kernel then has to fetch the data again which could slow down over all number of transactions per second. That said, this is something that an in-database scheduler could avoid by placing a lower priority on larger, more complex queries with the assumption being that having the smaller queries continue to process and get in/out is more important than shaving a few seconds off of a larger query that would deplete the cache used by the smaller queries. Oh to be a DBA and being able to make those decisions instead of the kernel... Hrm, so two ideas or questions come to mind: 1) On some of my really large read only queries, it would be SUUUPER nice to be able to re-nice the process from SQL land to 5, 10, or even 20. IIRC, BSD's VM system is smart enough to prevent lower priority jobs from monopolizing the disk cache, which would let the smaller faster turn around queries, continue to exist with their data in the kernel's disk cache. (some kind of query complexity threshold that results in a reduction of priority or an explicit directive to run at a lower priority) 2) Is there any way of specifying that a particular set of tables should be kept in RAM or some kind of write through cache? I know data is selected into a backend out of the catalogs, but would it be possible to have them kept in memory and only re-read on change with some kind of semaphore? Now that all system tables are in their own schemas (pg_catalog and pg_toast), would it be hard to set a flag on a change to those tables that would cause the postmaster, or children, to re-read then instead of rely on their cache? With copy-on-write forking, this could be pretty efficient if the postmaster did this and forked off a copy with the tables already in memory instead of on disk. Just a few ideas/ramblings, hope someone finds them interesting... the renice function is one that I think I'll spend some time looking into here shortly actually. -sc -- Sean Chittenden
On Mon, 27 Jan 2003, Sean Chittenden wrote: > The FreeBSD VM caching system does prevent one process from exhausting > another process's cached data due to a sequential access, but the > FreeBSD VM cache does not try to outsmart sequential data accesses to > datasets which are larger then available cache space because it's an > insanely difficult (impossible) problem to solve properly without > foreknowledge of what data elements will be accessed when. This is not impossible; Solaris does just this. I'm a little short of time right now, but I can probably dig up the paper on google if nobody else finds it. Also, it is not hard to give the OS foreknowledge of your access pattern, if you use mmap. Just call madvise and use the MADV_RANDOM, MADV_SEQUENTIAL, MADV_WILLNEED and MADV_DONTNEED flags. (This is one of the reasons I think we might see a performance improvement from switching from regular I/O to mmap I/O.) You can go back through the archives and see a much fuller discussion of all of this. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Mon, 2003-01-27 at 04:34, Curt Sampson wrote: > On Mon, 27 Jan 2003, Sean Chittenden wrote: > > > The FreeBSD VM caching system does prevent one process from exhausting > > another process's cached data due to a sequential access, but the > > FreeBSD VM cache does not try to outsmart sequential data accesses to > > datasets which are larger then available cache space because it's an > > insanely difficult (impossible) problem to solve properly without > > foreknowledge of what data elements will be accessed when. > > This is not impossible; Solaris does just this. I'm a little short of Quite. One way to do it is: - the OS notices that process X has been sequentially reading thru file Y for, say, 3 seconds. - the OS knows that X is currently at the mid-point of file Y - OS says, "Hey, I think I'll be a bit more agressive about, when I have a bit of time, trying to read Y faster than X is requesting it It wouldn't work well, though, in a client-server DB like Postgres, which, in a busy multi-user system, is constantly hitting different parts of different files. The algorithm, though, is used in the RDBMS Rdb. It uses the algorithm above, substituting "process X" for "client X", and passes the agressive reads of Y on to the OS. It's a big win when processing a complete table, like during a CREATE INDEX, or "SELECT foo, COUNT(*)" where there's no index on foo. -- +---------------------------------------------------------------+ | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | "Fear the Penguin!!" | +---------------------------------------------------------------+
Detecting sequential scan and increasing read-ahead is a standard OS capability, and most/all do that already. Solaris has code to detect when a sequential scan is wiping the cache and adjusting the buffer frees, called "free-behind." --------------------------------------------------------------------------- Ron Johnson wrote: > On Mon, 2003-01-27 at 04:34, Curt Sampson wrote: > > On Mon, 27 Jan 2003, Sean Chittenden wrote: > > > > > The FreeBSD VM caching system does prevent one process from exhausting > > > another process's cached data due to a sequential access, but the > > > FreeBSD VM cache does not try to outsmart sequential data accesses to > > > datasets which are larger then available cache space because it's an > > > insanely difficult (impossible) problem to solve properly without > > > foreknowledge of what data elements will be accessed when. > > > > This is not impossible; Solaris does just this. I'm a little short of > > Quite. One way to do it is: > - the OS notices that process X has been sequentially reading thru > file Y for, say, 3 seconds. > - the OS knows that X is currently at the mid-point of file Y > - OS says, "Hey, I think I'll be a bit more agressive about, when I > have a bit of time, trying to read Y faster than X is requesting > it > > It wouldn't work well, though, in a client-server DB like Postgres, > which, in a busy multi-user system, is constantly hitting different > parts of different files. > > The algorithm, though, is used in the RDBMS Rdb. It uses the algorithm > above, substituting "process X" for "client X", and passes the agressive > reads of Y on to the OS. It's a big win when processing a complete > table, like during a CREATE INDEX, or "SELECT foo, COUNT(*)" where > there's no index on foo. > > -- > +---------------------------------------------------------------+ > | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | > | Jefferson, LA USA http://members.cox.net/ron.l.johnson | > | | > | "Fear the Penguin!!" | > +---------------------------------------------------------------+ > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Mon, 2003-01-27 at 15:08, Bruce Momjian wrote: > Detecting sequential scan and increasing read-ahead is a standard OS > capability, and most/all do that already. Solaris has code to detect > when a sequential scan is wiping the cache and adjusting the buffer > frees, called "free-behind." Ah, didn't know that. > --------------------------------------------------------------------------- > > Ron Johnson wrote: > > On Mon, 2003-01-27 at 04:34, Curt Sampson wrote: > > > On Mon, 27 Jan 2003, Sean Chittenden wrote: > > > > > > > The FreeBSD VM caching system does prevent one process from exhausting > > > > another process's cached data due to a sequential access, but the > > > > FreeBSD VM cache does not try to outsmart sequential data accesses to > > > > datasets which are larger then available cache space because it's an > > > > insanely difficult (impossible) problem to solve properly without > > > > foreknowledge of what data elements will be accessed when. > > > > > > This is not impossible; Solaris does just this. I'm a little short of > > > > Quite. One way to do it is: > > - the OS notices that process X has been sequentially reading thru > > file Y for, say, 3 seconds. > > - the OS knows that X is currently at the mid-point of file Y > > - OS says, "Hey, I think I'll be a bit more agressive about, when I > > have a bit of time, trying to read Y faster than X is requesting > > it > > > > It wouldn't work well, though, in a client-server DB like Postgres, > > which, in a busy multi-user system, is constantly hitting different > > parts of different files. > > > > The algorithm, though, is used in the RDBMS Rdb. It uses the algorithm > > above, substituting "process X" for "client X", and passes the agressive > > reads of Y on to the OS. It's a big win when processing a complete > > table, like during a CREATE INDEX, or "SELECT foo, COUNT(*)" where > > there's no index on foo. -- +---------------------------------------------------------------+ | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | "Fear the Penguin!!" | +---------------------------------------------------------------+
On Tue, 27 Jan 2003, Ron Johnson wrote: > [read-ahead detection stuff deleted] > > It wouldn't work well, though, in a client-server DB like Postgres, > which, in a busy multi-user system, is constantly hitting different > parts of different files. It works great. You just do it on a file-descriptor by file-descriptor basis. Unfortunately, I don't know of any OSes that detect backwards scans. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC