Re: Very big insert/join performance problem (bacula) - Mailing list pgsql-performance

From Marc Cousin
Subject Re: Very big insert/join performance problem (bacula)
Date
Msg-id 200907151337.01759.cousinmarc@gmail.com
Whole thread Raw
In response to Re: Very big insert/join performance problem (bacula)  (Richard Huxton <dev@archonet.com>)
Responses Re: Very big insert/join performance problem (bacula)
Re: Very big insert/join performance problem (bacula)
List pgsql-performance
This mail contains the asked plans :

I've done them with the different configurations, as I had done the effort of setting up the whole thing :)
Stats were updated between all runs. Each time is the first run of the query (that's what we have in production with
bacula)
And I added the executor stats, in case ...

By the way, I think I must mention it, the whole thing runs over DRBD, but with 2 gigabyte links between the master and
theslave. 
And I tried deactivating replication when things got really slow (despooling in 24 hours), it changed nothing (sorts
werea bit faster,  
around 20%). Server is 12 GB ram, 1 quad core xeon E5335.

PostgreSQL starts to hash filename a bit later than what I said in the first mail, because it's become bigger (it was
around30-40 million last time I did the tests). 

This is the query (temp_mc is the table I've created to do my tests ...):

explain ANALYZE SELECT batch.FileIndex,
              batch.JobId,
              Path.PathId,
              Filename.FilenameId,
              batch.LStat,
              batch.MD5
  FROM temp_mc AS batch
  JOIN Path ON (batch.Path = Path.Path)
  JOIN Filename ON (batch.Name = Filename.Name);

++++++++++++++++++++++++++++++++++++++++++++++++++++
Plan 1
around 1 million records to insert, seq_page_cost 1, random_page_cost 4

LOG:  EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!       380.143452 elapsed 79.000938 user 44.386774 system sec
!       [415.785985 user 155.733732 sys total]
!       15848728/12934936 [24352752/50913184] filesystem blocks in/out
!       0/44188 [86/987512] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       93812/40706 [405069/184511] voluntary/involuntary context switches
! buffer usage stats:
!       Shared blocks:     877336 read,          0 written, buffer hit rate = 6.75%
!       Local  blocks:          0 read,          0 written, buffer hit rate = 0.00%
!       Direct blocks:          0 read,          0 written

 Hash Join  (cost=3923929.71..5131377.91 rows=1286440 width=91) (actual time=234021.194..380018.709 rows=1286440
loops=1)
   Hash Cond: (batch.name = filename.name)
   ->  Hash Join  (cost=880140.87..1286265.62 rows=1286440 width=102) (actual time=23184.959..102400.782 rows=1286440
loops=1)
         Hash Cond: (batch.path = path.path)
         ->  Seq Scan on temp_mc batch  (cost=0.00..49550.40 rows=1286440 width=189) (actual time=0.007..342.396
rows=1286440loops=1) 
         ->  Hash  (cost=425486.72..425486.72 rows=16746972 width=92) (actual time=23184.196..23184.196 rows=16732049
loops=1)
               ->  Seq Scan on path  (cost=0.00..425486.72 rows=16746972 width=92) (actual time=0.004..7318.850
rows=16732049loops=1) 
   ->  Hash  (cost=1436976.15..1436976.15 rows=79104615 width=35) (actual time=210831.840..210831.840 rows=79094418
loops=1)
         ->  Seq Scan on filename  (cost=0.00..1436976.15 rows=79104615 width=35) (actual time=46.324..148887.662
rows=79094418loops=1) 
 Total runtime: 380136.601 ms

++++++++++++++++++++++++++++++++++++++++++++++++++++
Plan 2
the same insert, with seq_page_cost to 0.01 and random_page_cost to 0.02

DETAIL:  ! system usage stats:
!       42.378039 elapsed 28.277767 user 12.192762 system sec
!       [471.865489 user 180.499280 sys total]
!       0/4072368 [24792848/59059032] filesystem blocks in/out
!       0/0 [86/989858] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       1061/9131 [429738/200320] voluntary/involuntary context switches
! buffer usage stats:
!       Shared blocks:     251574 read,          0 written, buffer hit rate = 96.27%
!       Local  blocks:          0 read,          0 written, buffer hit rate = 0.00%
!       Direct blocks:          0 read,          0 written
LOG:  duration: 42378.373 ms  statement:
                                                                  QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=381840.21..1012047.92 rows=1286440 width=91) (actual time=20284.387..42242.955 rows=1286440 loops=1)
   Hash Cond: (batch.path = path.path)
   ->  Nested Loop  (cost=0.00..583683.91 rows=1286440 width=178) (actual time=0.026..10333.636 rows=1286440 loops=1)
         ->  Seq Scan on temp_mc batch  (cost=0.00..13231.26 rows=1286440 width=189) (actual time=0.008..380.361
rows=1286440loops=1) 
         ->  Index Scan using filename_name_idx on filename  (cost=0.00..0.43 rows=1 width=35) (actual
time=0.006..0.007rows=1 loops=1286440) 
               Index Cond: (filename.name = batch.name)
   ->  Hash  (cost=170049.89..170049.89 rows=16746972 width=92) (actual time=20280.729..20280.729 rows=16732049
loops=1)
         ->  Seq Scan on path  (cost=0.00..170049.89 rows=16746972 width=92) (actual time=0.005..4560.872 rows=16732049
loops=1)
 Total runtime: 42371.362 ms


The thing is that this query is ten times faster, but it's not the main point : this query stays reasonably fast even
whenthere are 
20 of it running simultaneously. Of course, as it's faster, it also has less tendancy to pile up than the other one
does.

When I get 10-20 of the first one running at the same time, the queries get extremely slow (I guess they are fighting
for accessing the sort disk, because I see a lot of smaller IOs instead of the big and nice IOs I see when only one of
these queries runs). The IO subsystem seems to degrade very much when there is a lot of concurrent activity on this
server.
For instance, last weekend, we had to 8 million simultaneous backups, with the hash join plan. It took 24 hours for
themto complete. 
If they had been alone on the server, it would have taken around 1 hour for each of them.


Of course, with these smaller cost values, there is still a batch size when the plans goes back to the first one.

++++++++++++++++++++++++++++++++++++++++++++++++++++
Plan 3
seq_page_cost to 1, random_page_cost to 2. Plan is the same as Plan 1.

-------------------------------------------------------------------------------------
 Hash Join  (cost=3923961.69..5131416.88 rows=1286440 width=91)
   Hash Cond: (batch.name = filename.name)
   ->  Hash Join  (cost=880144.31..1286270.06 rows=1286440 width=102)
         Hash Cond: (batch.path = path.path)
         ->  Seq Scan on temp_mc batch  (cost=0.00..49550.40 rows=1286440 width=189)
         ->  Hash  (cost=425488.36..425488.36 rows=16747036 width=92)
               ->  Seq Scan on path  (cost=0.00..425488.36 rows=16747036 width=92)
   ->  Hash  (cost=1436989.50..1436989.50 rows=79105350 width=35)
         ->  Seq Scan on filename  (cost=0.00..1436989.50 rows=79105350 width=35)
(9 rows)

++++++++++++++++++++++++++++++++++++++++++++++++++++
Plan 4:
seq_page_cost to 1, random_page_cost back to 4, raise work_mem to 512MB. Same as Plan 1
Estimated cost hasn't changed. Is this normal ?

-------------------------------------------------------------------------------------
 Hash Join  (cost=3923961.69..5131416.88 rows=1286440 width=91)
   Hash Cond: (batch.name = filename.name)
   ->  Hash Join  (cost=880144.31..1286270.06 rows=1286440 width=102)
         Hash Cond: (batch.path = path.path)
         ->  Seq Scan on temp_mc batch  (cost=0.00..49550.40 rows=1286440 width=189)
         ->  Hash  (cost=425488.36..425488.36 rows=16747036 width=92)
               ->  Seq Scan on path  (cost=0.00..425488.36 rows=16747036 width=92)
   ->  Hash  (cost=1436989.50..1436989.50 rows=79105350 width=35)
         ->  Seq Scan on filename  (cost=0.00..1436989.50 rows=79105350 width=35)
(9 rows)

Maybe this one would scale a bit better, as there would be less sort files ? I couldn't execute it and get reliable
times(sorry, the production period has started). 
If necessary, I can run it again tomorrow. I had to cancel the query after more than 15 minutes, to let the server do
it'sregular work. 



There are other things I am thinking of : maybe it would be better to have sort space on another (and not DBRD'ded)
raidset ? we have a quite 
cheap setup right now for the database, and I think maybe this would help scale better. I can get a filesystem in
anothervolume group, which is not used that much for now. 

Anyway, thanks for all the ideas you could have on this.

Marc.

pgsql-performance by date:

Previous
From: Scott Marlowe
Date:
Subject: Re: Performance comparison between Postgres and Greenplum
Next
From: Alex Goncharov
Date:
Subject: Re: Performance comparison between Postgres and Greenplum