Re: Very big insert/join performance problem (bacula) - Mailing list pgsql-performance
From | Marc Cousin |
---|---|
Subject | Re: Very big insert/join performance problem (bacula) |
Date | |
Msg-id | 200907151337.01759.cousinmarc@gmail.com Whole thread Raw |
In response to | Re: Very big insert/join performance problem (bacula) (Richard Huxton <dev@archonet.com>) |
Responses |
Re: Very big insert/join performance problem (bacula)
Re: Very big insert/join performance problem (bacula) |
List | pgsql-performance |
This mail contains the asked plans : I've done them with the different configurations, as I had done the effort of setting up the whole thing :) Stats were updated between all runs. Each time is the first run of the query (that's what we have in production with bacula) And I added the executor stats, in case ... By the way, I think I must mention it, the whole thing runs over DRBD, but with 2 gigabyte links between the master and theslave. And I tried deactivating replication when things got really slow (despooling in 24 hours), it changed nothing (sorts werea bit faster, around 20%). Server is 12 GB ram, 1 quad core xeon E5335. PostgreSQL starts to hash filename a bit later than what I said in the first mail, because it's become bigger (it was around30-40 million last time I did the tests). This is the query (temp_mc is the table I've created to do my tests ...): explain ANALYZE SELECT batch.FileIndex, batch.JobId, Path.PathId, Filename.FilenameId, batch.LStat, batch.MD5 FROM temp_mc AS batch JOIN Path ON (batch.Path = Path.Path) JOIN Filename ON (batch.Name = Filename.Name); ++++++++++++++++++++++++++++++++++++++++++++++++++++ Plan 1 around 1 million records to insert, seq_page_cost 1, random_page_cost 4 LOG: EXECUTOR STATISTICS DETAIL: ! system usage stats: ! 380.143452 elapsed 79.000938 user 44.386774 system sec ! [415.785985 user 155.733732 sys total] ! 15848728/12934936 [24352752/50913184] filesystem blocks in/out ! 0/44188 [86/987512] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 93812/40706 [405069/184511] voluntary/involuntary context switches ! buffer usage stats: ! Shared blocks: 877336 read, 0 written, buffer hit rate = 6.75% ! Local blocks: 0 read, 0 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written Hash Join (cost=3923929.71..5131377.91 rows=1286440 width=91) (actual time=234021.194..380018.709 rows=1286440 loops=1) Hash Cond: (batch.name = filename.name) -> Hash Join (cost=880140.87..1286265.62 rows=1286440 width=102) (actual time=23184.959..102400.782 rows=1286440 loops=1) Hash Cond: (batch.path = path.path) -> Seq Scan on temp_mc batch (cost=0.00..49550.40 rows=1286440 width=189) (actual time=0.007..342.396 rows=1286440loops=1) -> Hash (cost=425486.72..425486.72 rows=16746972 width=92) (actual time=23184.196..23184.196 rows=16732049 loops=1) -> Seq Scan on path (cost=0.00..425486.72 rows=16746972 width=92) (actual time=0.004..7318.850 rows=16732049loops=1) -> Hash (cost=1436976.15..1436976.15 rows=79104615 width=35) (actual time=210831.840..210831.840 rows=79094418 loops=1) -> Seq Scan on filename (cost=0.00..1436976.15 rows=79104615 width=35) (actual time=46.324..148887.662 rows=79094418loops=1) Total runtime: 380136.601 ms ++++++++++++++++++++++++++++++++++++++++++++++++++++ Plan 2 the same insert, with seq_page_cost to 0.01 and random_page_cost to 0.02 DETAIL: ! system usage stats: ! 42.378039 elapsed 28.277767 user 12.192762 system sec ! [471.865489 user 180.499280 sys total] ! 0/4072368 [24792848/59059032] filesystem blocks in/out ! 0/0 [86/989858] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 1061/9131 [429738/200320] voluntary/involuntary context switches ! buffer usage stats: ! Shared blocks: 251574 read, 0 written, buffer hit rate = 96.27% ! Local blocks: 0 read, 0 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written LOG: duration: 42378.373 ms statement: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=381840.21..1012047.92 rows=1286440 width=91) (actual time=20284.387..42242.955 rows=1286440 loops=1) Hash Cond: (batch.path = path.path) -> Nested Loop (cost=0.00..583683.91 rows=1286440 width=178) (actual time=0.026..10333.636 rows=1286440 loops=1) -> Seq Scan on temp_mc batch (cost=0.00..13231.26 rows=1286440 width=189) (actual time=0.008..380.361 rows=1286440loops=1) -> Index Scan using filename_name_idx on filename (cost=0.00..0.43 rows=1 width=35) (actual time=0.006..0.007rows=1 loops=1286440) Index Cond: (filename.name = batch.name) -> Hash (cost=170049.89..170049.89 rows=16746972 width=92) (actual time=20280.729..20280.729 rows=16732049 loops=1) -> Seq Scan on path (cost=0.00..170049.89 rows=16746972 width=92) (actual time=0.005..4560.872 rows=16732049 loops=1) Total runtime: 42371.362 ms The thing is that this query is ten times faster, but it's not the main point : this query stays reasonably fast even whenthere are 20 of it running simultaneously. Of course, as it's faster, it also has less tendancy to pile up than the other one does. When I get 10-20 of the first one running at the same time, the queries get extremely slow (I guess they are fighting for accessing the sort disk, because I see a lot of smaller IOs instead of the big and nice IOs I see when only one of these queries runs). The IO subsystem seems to degrade very much when there is a lot of concurrent activity on this server. For instance, last weekend, we had to 8 million simultaneous backups, with the hash join plan. It took 24 hours for themto complete. If they had been alone on the server, it would have taken around 1 hour for each of them. Of course, with these smaller cost values, there is still a batch size when the plans goes back to the first one. ++++++++++++++++++++++++++++++++++++++++++++++++++++ Plan 3 seq_page_cost to 1, random_page_cost to 2. Plan is the same as Plan 1. ------------------------------------------------------------------------------------- Hash Join (cost=3923961.69..5131416.88 rows=1286440 width=91) Hash Cond: (batch.name = filename.name) -> Hash Join (cost=880144.31..1286270.06 rows=1286440 width=102) Hash Cond: (batch.path = path.path) -> Seq Scan on temp_mc batch (cost=0.00..49550.40 rows=1286440 width=189) -> Hash (cost=425488.36..425488.36 rows=16747036 width=92) -> Seq Scan on path (cost=0.00..425488.36 rows=16747036 width=92) -> Hash (cost=1436989.50..1436989.50 rows=79105350 width=35) -> Seq Scan on filename (cost=0.00..1436989.50 rows=79105350 width=35) (9 rows) ++++++++++++++++++++++++++++++++++++++++++++++++++++ Plan 4: seq_page_cost to 1, random_page_cost back to 4, raise work_mem to 512MB. Same as Plan 1 Estimated cost hasn't changed. Is this normal ? ------------------------------------------------------------------------------------- Hash Join (cost=3923961.69..5131416.88 rows=1286440 width=91) Hash Cond: (batch.name = filename.name) -> Hash Join (cost=880144.31..1286270.06 rows=1286440 width=102) Hash Cond: (batch.path = path.path) -> Seq Scan on temp_mc batch (cost=0.00..49550.40 rows=1286440 width=189) -> Hash (cost=425488.36..425488.36 rows=16747036 width=92) -> Seq Scan on path (cost=0.00..425488.36 rows=16747036 width=92) -> Hash (cost=1436989.50..1436989.50 rows=79105350 width=35) -> Seq Scan on filename (cost=0.00..1436989.50 rows=79105350 width=35) (9 rows) Maybe this one would scale a bit better, as there would be less sort files ? I couldn't execute it and get reliable times(sorry, the production period has started). If necessary, I can run it again tomorrow. I had to cancel the query after more than 15 minutes, to let the server do it'sregular work. There are other things I am thinking of : maybe it would be better to have sort space on another (and not DBRD'ded) raidset ? we have a quite cheap setup right now for the database, and I think maybe this would help scale better. I can get a filesystem in anothervolume group, which is not used that much for now. Anyway, thanks for all the ideas you could have on this. Marc.
pgsql-performance by date: