Re: Best way to delete unreferenced rows? - Mailing list pgsql-performance
From | Ed Tyrrill |
---|---|
Subject | Re: Best way to delete unreferenced rows? |
Date | |
Msg-id | 1181610586.27394.7.camel@nickel.avamar.com Whole thread Raw |
In response to | Re: Best way to delete unreferenced rows? (Craig James <craig_james@emolecules.com>) |
List | pgsql-performance |
Craig James wrote: > Tyrrill, Ed wrote: > > QUERY PLAN > > > > > ------------------------------------------------------------------------ > > > ------------------------------------------------------------------------ > > ------------------- > > Merge Left Join (cost=38725295.93..42505394.70 rows=13799645 > width=8) > > (actual time=6503583.342..8220629.311 rows=93524 loops=1) > > Merge Cond: ("outer".record_id = "inner".record_id) > > Filter: ("inner".record_id IS NULL) > > -> Index Scan using backupobjects_pkey on backupobjects > > (cost=0.00..521525.10 rows=13799645 width=8) (actual > > time=15.955..357813.621 rows=13799645 loops=1) > > -> Sort (cost=38725295.93..39262641.69 rows=214938304 width=8) > > (actual time=6503265.293..7713657.750 rows=214938308 loops=1) > > Sort Key: backup_location.record_id > > -> Seq Scan on backup_location (cost=0.00..3311212.04 > > rows=214938304 width=8) (actual time=11.175..1881179.825 > rows=214938308 > > loops=1) > > Total runtime: 8229178.269 ms > > (8 rows) > > > > I ran vacuum analyze after the last time any inserts, deletes, or > > updates were done, and before I ran the query above. I've attached > my > > postgresql.conf. The machine has 4 GB of RAM. > > I thought maybe someone with more expertise than me might answer this, > but since they haven't I'll just make a comment. It looks to me like > the sort of 214 million rows is what's killing you. I suppose you > could try to increase the sort memory, but that's a lot of memory. It > seems to me an index merge of a relation this large would be faster, > but that's a topic for the experts. > > On a theoretical level, the problem is that it's sorting the largest > table. Perhaps you could re-cast the query so that it only has to > sort the smaller table, something like > > select a.id from a where a.id not in (select distinct b.id from b) > > where "b" is the smaller table. There's still no guarantee that it > won't do a sort on "a", though. In fact one of the clever things > about Postgres is that it can convert a query like the one above into > a regular join, unless you do something like "select ... offset 0" > which blocks the optimizer from doing the rearrangement. > > But I think the first approach is to try to tune for a better plan > using your original query. > > Craig Thanks for the input Craig. I actually started out with a query similar to what you suggest, but the performance was days to complete back when the larger table, backup_location, was still under 100 million rows. The current query is the best performance to date. I have been playing around with work_mem, and doubling it to 128MB did result in some improvement, but doubleing it again to 256MB showed no further gain. Here is the explain analyze with work_mem increased to 128MB: mdsdb=# explain analyze select backupobjects.record_id from backupobjects left outer join backup_location using (record_id) where backup_location.record_id is null; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Left Join (cost=36876242.28..40658535.53 rows=13712990 width=8) (actual time=5795768.950..5795768.950 rows=0 loops=1) Merge Cond: ("outer".record_id = "inner".record_id) Filter: ("inner".record_id IS NULL) -> Index Scan using backupobjects_pkey on backupobjects (cost=0.00..520571.89 rows=13712990 width=8) (actual time=2.490..201516.228 rows=13706121 loops=1) -> Sort (cost=36876242.28..37414148.76 rows=215162592 width=8) (actual time=4904205.255..5440137.309 rows=215162559 loops=1) Sort Key: backup_location.record_id -> Seq Scan on backup_location (cost=0.00..3314666.92 rows=215162592 width=8) (actual time=4.186..1262641.774 rows=215162559 loops=1) Total runtime: 5796322.535 ms
pgsql-performance by date: