Re: Speeding up query, Joining 55mil and 43mil records. - Mailing list pgsql-performance
From | nicky |
---|---|
Subject | Re: Speeding up query, Joining 55mil and 43mil records. |
Date | |
Msg-id | 449A677D.4020504@valuecare.nl Whole thread Raw |
In response to | Re: Speeding up query, Joining 55mil and 43mil records. ("Jim C. Nasby" <jnasby@pervasive.com>) |
Responses |
Re: Speeding up query, Joining 55mil and 43mil records.
|
List | pgsql-performance |
Hello again, thanks for all the quick replies. It seems i wasn't entirely correct on my previous post, i've mixed up some times/numbers. Below the correct numbers MSSQL: SELECT COUNT(*) from JOIN (without insert) 17 minutes PostgreSQL: SELECT COUNT(*) from JOIN (without insert) 33 minutes PostgreSQL: complete query 55 minutes The part i'm really troubled with is the difference in performance for the select part. Which takes twice as long on PostgreSQL even though it has a better server then MSSQL. Changed i've made to postgressql.conf work_mem = 524288 (1GB, results in out of memory error) checkpoints_segments = 256 checkpoints_timeout = 3600 checkpoints_warning = 0 I've ran the complete 'explain analyse query' twice. First with pgsql_tmp on the same disk, then again with pgsql_tmp on a seperate disk. **** (PostgreSQL) (*pgsql_tmp on same disk*): Hash Join (cost=1328360.12..6167462.76 rows=7197568 width=118) (actual time=327982.425..1903423.769 rows=7551616 loops=1) Hash Cond: (("outer".id)::text = ("inner".id)::text) -> Seq Scan on src_faktuur_verrsec t0 (cost=0.00..2773789.90 rows=40902852 width=52) (actual time=8.935..613455.204 rows=37368390 loops=1) Filter: ((substr((code)::text, 1, 2) <> '14'::text) AND (substr((code)::text, 1, 2) <> '15'::text) AND (substr((code)::text, 1, 2) <> '16'::text) AND (substr((code)::text, 1, 2) <> '17'::text) AND ((substr((correctie)::text, 4, 1) <> '1'::text) OR (correctie IS NULL))) -> Hash (cost=1188102.97..1188102.97 rows=8942863 width=80) (actual time=327819.698..327819.698 rows=8761024 loops=1) -> Bitmap Heap Scan on src_faktuur_verricht t1 (cost=62392.02..1188102.97 rows=8942863 width=80) (actual time=75911.336..295510.647 rows=8761024 loops=1) Recheck Cond: (date_part('year'::text, datum) > 2004::double precision) -> Bitmap Index Scan on src_faktuur_verricht_idx1 (cost=0.00..62392.02 rows=8942863 width=0) (actual time=75082.080..75082.080 rows=8761024 loops=1) Index Cond: (date_part('year'::text, datum) > 2004::double precision) Total runtime: 3355696.015 ms **** (PostgreSQL) (*pgsql_tmp on seperate disk*) Hash Join (cost=1328360.12..6167462.76 rows=7197568 width=118) (actual time=172797.736..919869.708 rows=7551616 loops=1) Hash Cond: (("outer".id)::text = ("inner".id)::text) -> Seq Scan on src_faktuur_verrsec t0 (cost=0.00..2773789.90 rows=40902852 width=52) (actual time=0.015..362154.822 rows=37368390 loops=1) Filter: ((substr((code)::text, 1, 2) <> '14'::text) AND (substr((code)::text, 1, 2) <> '15'::text) AND (substr((code)::text, 1, 2) <> '16'::text) AND (substr((code)::text, 1, 2) <> '17'::text) AND ((substr((correctie)::text, 4, 1) <> '1'::text) OR (correctie IS NULL))) -> Hash (cost=1188102.97..1188102.97 rows=8942863 width=80) (actual time=172759.255..172759.255 rows=8761024 loops=1) -> Bitmap Heap Scan on src_faktuur_verricht t1 (cost=62392.02..1188102.97 rows=8942863 width=80) (actual time=4244.840..142144.606 rows=8761024 loops=1) Recheck Cond: (date_part('year'::text, datum) > 2004::double precision) -> Bitmap Index Scan on src_faktuur_verricht_idx1 (cost=0.00..62392.02 rows=8942863 width=0) (actual time=3431.361..3431.361 rows=8761024 loops=1) Index Cond: (date_part('year'::text, datum) > 2004::double precision) Total runtime: 2608316.714 ms A lot of difference in performance. 55 minutes to 42 minutes. I've ran the 'select count(*) from JOIN' to see the difference on that part. **** (PostgreSQL) Explain analyse from SELECT COUNT(*) from the JOIN. (*pgsql_tmp on seperate disk*) Aggregate (cost=5632244.93..5632244.94 rows=1 width=0) (actual time=631993.425..631993.427 rows=1 loops=1) -> Hash Join (cost=1258493.12..5614251.00 rows=7197568 width=0) (actual time=237999.277..620018.706 rows=7551616 loops=1) Hash Cond: (("outer".id)::text = ("inner".id)::text) -> Seq Scan on src_faktuur_verrsec t0 (cost=0.00..2773789.90 rows=40902852 width=14) (actual time=23.449..200532.422 rows=37368390 loops=1) Filter: ((substr((code)::text, 1, 2) <> '14'::text) AND (substr((code)::text, 1, 2) <> '15'::text) AND (substr((code)::text, 1, 2) <> '16'::text) AND (substr((code)::text, 1, 2) <> '17'::text) AND ((substr((correctie)::text, 4, 1) <> '1'::tex (..) -> Hash (cost=1188102.97..1188102.97 rows=8942863 width=14) (actual time=237939.262..237939.262 rows=8761024 loops=1) -> Bitmap Heap Scan on src_faktuur_verricht t1 (cost=62392.02..1188102.97 rows=8942863 width=14) (actual time=74713.092..216206.478 rows=8761024 loops=1) Recheck Cond: (date_part('year'::text, datum) > 2004::double precision) -> Bitmap Index Scan on src_faktuur_verricht_idx1 (cost=0.00..62392.02 rows=8942863 width=0) (actual time=73892.153..73892.153 rows=8761024 loops=1) Index Cond: (date_part('year'::text, datum) > 2004::double precision) Total runtime: 631994.172 ms A lot of improvement also in the select count: 33 minutes vs 10 minutes. To us, the speeds are good. Very happy with the performance increase on that select with join, since 90% of the queries are SELECT based. The query results in 7551616 records, so that's about 4500 inserts per second. I'm not sure if that is fast or not. Any further tips would be welcome. Thanks everyone. Nicky
pgsql-performance by date: