Re: Hash join on int takes 8..114 seconds - Mailing list pgsql-performance
From | Andrus |
---|---|
Subject | Re: Hash join on int takes 8..114 seconds |
Date | |
Msg-id | EDAB641B4D004D1BB5126A75CFBDD79F@andrusnotebook Whole thread Raw |
In response to | Re: Hash join on int takes 8..114 seconds (tv@fuzzy.cz) |
Responses |
Re: Hash join on int takes 8..114 seconds
Re: Hash join on int takes 8..114 seconds Re: Hash join on int takes 8..114 seconds |
List | pgsql-performance |
Thomas, Thank you. > Just the most important points: > > 1) "dok" table contains 1235086 row versions in 171641 pages (with 8kB > pages this means 1.4GB MB of data), but there are 1834279 unused item > pointers (i.e. about 60% of the space is wasted) > > 2) "rid" table contains 3275189 roiws in 165282 (with 8kB pages this means > about 1.3GB of data), but there are 1878923 unused item pointers (i.e. > about 30% of the space is wasted) > > 3) don't forget to execute analyze after vacuuming (or vacuum analyze) autovacuum is running. So if I understand properly, I must ran VACUUM FULL ANALYZE dok; VACUUM FULL ANALYZE rid; Those commands cause server probably to stop responding to other client like vacuum full pg_shdepend did. Should vacuum_cost_delay = 2000 allow other users to work when running those commands ? > 4) I'm not sure why the sizes reported by you (for example 2.3GB vs 1.5GB > for "doc" table) - the difference seems too large for me. I used pg_total_relation_size(). So 2.3 GB includes indexes also: 8 44286 dok_tasudok_idx 245 MB 10 44283 dok_klient_idx 142 MB 18 44288 dok_tasumata_idx 91 MB 19 44289 dok_tellimus_idx 89 MB 20 44284dok_krdokumnr_idx 89 MB 21 44285 dok_kuupaev_idx 84 MB 22 43531 makse_pkey 77 MB 23 43479 dok_pkey 74 MB 24 44282 dok_dokumnr_idx 74 MB 26 18663923 dok_yksus_pattern_idx 43 MB 27 18801591 dok_sihtyksus_pattern_idx 42 MB > Anyway the amount of wasted rows seems significant to me - I'd try to > solve this first. Either by VACUUM FULL or by CLUSTER. The CLUSTER will > lock the table exclusively, but the results may be better (when sorting by > a well chosen index). Don't forget to run ANALYZE afterwards. How to invoke those commands so that other clients can continue work? I'm using 8.1.4. Log files show that autovacuum is running. I'm planning the following solution: 1. Set vacuum_cost_delay=2000 2. Run the following commands periodically in this order: VACUUM FULL; vacuum full pg_shdepend; CLUSTER rid on (toode); CLUSTER dok on (kuupaev); REINDEX DATABASE mydb; REINDEX SYSTEM mydb; ANALYZE; Are all those command required or can something leaved out ? > Several other things to consider: > > 1) Regarding the toode column - why are you using CHAR(20) when the values > are actually shorter? This may significantly increase the amount of space > required. There may be some products whose codes may be up to 20 characters. PostgreSQL does not hold trailing spaces in db, so this does *not* affect to space. > 2) I've noticed the CPU used is Celeron, which may negatively affect the > speed of hash computation. I'd try to replace it by something faster - say > INTEGER as an artificial primary key of the "toode" table and using it as > a FK in other tables. This might improve the "Bitmap Heap Scan on rid" > part, but yes - it's just a minor improvement compared to the "Hash Join" > part of the query. Natural key Toode CHAR(20) is used widely in different queries. Replacing it with INT surrogate key requires major application rewrite. Should I add surrogate index INT columns to toode and rid table and measure test query speed in this case? > Materialized views seem like a good idea to me, but maybe I'm not seeing > something. What do you mean by "reports are different"? If there is a lot > of rows for a given product / day, then creating an aggregated table with > (product code / day) as a primary key is quite simple. It may require a > lot of disk space, but it'll remove the hash join overhead. But if the > queries are very different, then it may be difficult to build such > materialized view(s). log file seems that mostly only those queries are slow: SELECT ... FROM dok JOIN rid USING (dokumnr) JOIN ProductId USING (ProductId) WHERE rid.ProductId LIKE :p1 || '%' AND dok.SaleDate>=:p2 :p1 and :p2 are parameters different for different queries. dok contains several years of data. :p2 is usually only few previous months or last year ago. SELECT column list contains fixed list of known columns from all tables. How to create index or materialized view to optimize this types of queries ? Andrus.
pgsql-performance by date: