Thread: Joins and DELETE FROM
Hi!
As part of a data warehousing project, I need to pre-process data downloaded from an external source, in the form of several large flat files. This preprocessing entails checking the validity of various data items, and discarding those that fail to pass all the checks.
Currently, the code that performs the checks generates intermediate temporary tables of those bits of data that are invalid in some way. (This simplifies the process of generating various quality-control reports about the incoming data).
The next step is to weed out the bad data from the main tables, and here's where I begin to get lost.
To be concrete, suppose I have a table T consisting of 20 million rows, keyed on some column K. (There are no formal constrains on T at the moment, but one could define column K as T's primary key.) Suppose also that I have a second table B (for "bad") consisting of 20 thousand rows, and also keyed on some column K. For each value of B.K there is exactly one row in T such that T.K = B.K, and the task is to delete all these rows from T as efficiently as possible.
My naive approach would something like
DELETE FROM T WHERE T.K IN ( SELECT K FROM B );
...which, according to EXPLAIN, is a terrible idea, because it involves sequentially scanning all 20 million rows of T just to delete about only 0.1% of them.
It seems to me better to sequentially scan B and rely on an index on T to zero-in the few rows in T that must be deleted.
Is this strategy something that can be done with plain SQL (even if to do this I must produce additional helper tables, indices, etc.), or must I write a stored procedure to implement it?
TIA!
Kynn
Kynn Jones wrote: > Hi! > > As part of a data warehousing project, I need to pre-process data downloaded > from an external source, in the form of several large flat files. This > preprocessing entails checking the validity of various data items, and > discarding those that fail to pass all the checks. > > Currently, the code that performs the checks generates intermediate > temporary tables of those bits of data that are invalid in some way. (This > simplifies the process of generating various quality-control reports about > the incoming data). > > The next step is to weed out the bad data from the main tables, and here's > where I begin to get lost. > > To be concrete, suppose I have a table T consisting of 20 million rows, > keyed on some column K. (There are no formal constrains on T at the moment, > but one could define column K as T's primary key.) Suppose also that I have > a second table B (for "bad") consisting of 20 thousand rows, and also keyed > on some column K. For each value of B.K there is exactly one row in T such > that T.K = B.K, and the task is to delete all these rows from T as > efficiently as possible. > > My naive approach would something like > > DELETE FROM T WHERE T.K IN ( SELECT K FROM B ); > > ...which, according to EXPLAIN, is a terrible idea, because it involves > sequentially scanning all 20 million rows of T just to delete about only > 0.1% of them. > > It seems to me better to sequentially scan B and rely on an index on T to > zero-in the few rows in T that must be deleted. > > Is this strategy something that can be done with plain SQL (even if to do > this I must produce additional helper tables, indices, etc.), or must I > write a stored procedure to implement it? The planner knows how to produce such a plan, so it must thinking that it's not the fastest plan. Have you ANALYZEd the tables? You do have an index on T.K, right? What does EXPLAIN ANALYZE output look like? (you can do BEGIN; EXPLAIN ANALYZE ...; ROLLBACK; if you don't want to actually delete the rows) The sequential scan really could be the fastest way to do that. If those 0.1% of the rows are scattered randomly across the table, an index scan might end up fetching almost every page, but using random I/O which is much slower than a sequential read. For example, assuming you can fit 100 rows on a page, deleting 0.1% of the rows would have to access ~ 10% of the pages. At that point, it can easily be cheaper to just seq scan it. You can try to coerce the planner to choose the indexscan with "set enable_seqscan=off", to see how fast it actually is. You could also write the query as DELETE FROM t USING b WHERE t.k = b.k, but I doubt it makes much difference. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, Mar 8, 2008 at 1:01 PM, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
The planner knows how to produce such a plan, so it must thinking thatKynn Jones wrote:
> Hi!
>
> As part of a data warehousing project, I need to pre-process data downloaded
> from an external source, in the form of several large flat files. This
> preprocessing entails checking the validity of various data items, and
> discarding those that fail to pass all the checks.
>
> Currently, the code that performs the checks generates intermediate
> temporary tables of those bits of data that are invalid in some way. (This
> simplifies the process of generating various quality-control reports about
> the incoming data).
>
> The next step is to weed out the bad data from the main tables, and here's
> where I begin to get lost.
>
> To be concrete, suppose I have a table T consisting of 20 million rows,
> keyed on some column K. (There are no formal constrains on T at the moment,
> but one could define column K as T's primary key.) Suppose also that I have
> a second table B (for "bad") consisting of 20 thousand rows, and also keyed
> on some column K. For each value of B.K there is exactly one row in T such
> that T.K = B.K, and the task is to delete all these rows from T as
> efficiently as possible.
>
> My naive approach would something like
>
> DELETE FROM T WHERE T.K IN ( SELECT K FROM B );
>
> ...which, according to EXPLAIN, is a terrible idea, because it involves
> sequentially scanning all 20 million rows of T just to delete about only
> 0.1% of them.
>
> It seems to me better to sequentially scan B and rely on an index on T to
> zero-in the few rows in T that must be deleted.
>
> Is this strategy something that can be done with plain SQL (even if to do
> this I must produce additional helper tables, indices, etc.), or must I
> write a stored procedure to implement it?
it's not the fastest plan.
Curious.
Have you ANALYZEd the tables? You do have an index on T.K, right? What
does EXPLAIN ANALYZE output look like? (you can do BEGIN; EXPLAIN
ANALYZE ...; ROLLBACK; if you don't want to actually delete the rows)
Yes, all the tables have been vacuumed and analyzed, and there's an index on T.K (and on also on B.K for good measure).
You can try to coerce the planner to choose the indexscan with "set
enable_seqscan=off", to see how fast it actually is.
Thanks, that was a useful trick. I tried it on a simpler case: just the natural join of T and B. (I also used smaller versions of the table, but with a size ratio similar to the one in my hypothetical example.) Indeed, when I turn off sequential scans, the resulting query is over 2X faster.
my_db=> SET ENABLE_SEQSCAN TO ON;
my_db=> EXPLAIN ANALYZE SELECT * FROM T NATURAL JOIN B;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=7634.14..371997.64 rows=219784 width=13) (actual time=176.065..12041.486 rows=219784 loops=1)
Hash Cond: (t.k = b.k)
-> Seq Scan on t (cost=0.00..172035.56 rows=10509456 width=13) (actual time=0.023..2379.407 rows=10509456 loops=1)
-> Hash (cost=3598.84..3598.84 rows=219784 width=12) (actual time=171.868..171.868 rows=219784 loops=1)
-> Seq Scan on b (cost=0.00..3598.84 rows=219784 width=12) (actual time=0.013..49.626 rows=219784 loops=1)
Total runtime: 12064.966 ms
(6 rows)
my_db=> SET ENABLE_SEQSCAN TO OFF;
my_db=> EXPLAIN ANALYZE SELECT * FROM T NATURAL JOIN B;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=0.00..423589.69 rows=219784 width=13) (actual time=0.114..5449.808 rows=219784 loops=1)
Merge Cond: (t.k = b.k)
-> Index Scan using idx__t on t (cost=0.00..386463.71 rows=10509456 width=13) (actual time=0.059..3083.182 rows=10509414 loops=1)
-> Index Scan using idx__b on b (cost=0.00..8105.04 rows=219784 width=12) (actual time=0.044..69.659 rows=219784 loops=1)
Total runtime: 5473.812 ms
(5 rows)
Honestly, I still have not learned to fully decipher the output of EXPLAN/EXPLAIN ANALYZE. (The PostgreSQL docs are generally superb, IMO, but there's still a big hole on the subject of the query planner, including the interpretation of these query plans.)
So it seems like turning off ENABLE_SEQSCAN is the way to go. I wonder how much faster the query would be if I could selectively turn of the sequential scan on T. (The one on B seems to me reasonable.)
You could also write the query as DELETE FROM t USING b WHERE t.k = b.k,
but I doubt it makes much difference.
You're right: no difference at all (same query plan).
Thanks again!
Kynn
"Kynn Jones" <kynnjo@gmail.com> writes: > So it seems like turning off ENABLE_SEQSCAN is the way to go. Try reducing random_page_cost a bit instead. Also, have you got effective_cache_size set to something that's realistic for your machine? One problem with this test is that your smaller tables probably fit in memory whereas the big ones may not, so it's not a given that your test accurately reflects how the real query will go down. regards, tom lane
Kynn Jones wrote: > my_db=> SET ENABLE_SEQSCAN TO OFF; > my_db=> EXPLAIN ANALYZE SELECT * FROM T NATURAL JOIN B; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------- > Merge Join (cost=0.00..423589.69 rows=219784 width=13) (actual time= > 0.114..5449.808 rows=219784 loops=1) > Merge Cond: (t.k = b.k) > -> Index Scan using idx__t on t (cost=0.00..386463.71 rows=10509456 > width=13) (actual time=0.059..3083.182 rows=10509414 loops=1) > -> Index Scan using idx__b on b (cost=0.00..8105.04 rows=219784 > width=12) (actual time=0.044..69.659 rows=219784 loops=1) > Total runtime: 5473.812 ms > (5 rows) That's more like 2% of the rows, not 0.1%. Note that this still isn't the plan you were asking for, it's still scanning the whole index for t, not just looking up the keys from b. What you wanted is a nested loop join. You could try to turn enable_mergejoin=off as well if you want to coerce the planner even more... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Thank you for your post. I finally spent some quality time with the query planner section in the docs' server config chapter. Very instructive, even considering that most of it went over my head!
On Sat, Mar 8, 2008 at 4:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
...have you got effective_cache_size set to something that's realistic for your machine?
I guess not. It was the default value (128MB) on a machine with 4GB of RAM. It's not a dedicated server, though, so I'll set it to 1G for now.
But before doing so I need a clarification. The docs state that this parameter is used only for cost estimation, and has no effect on actual memory allocations. I imagine that if other memory-related settings are not somehow in line with it, it could lead to estimates that are out of touch with reality. If this is correct what other memory-related parameters do I need to adjust to ensure that both the planner's estimates and the actual execution agree and fit well with the available memory?
One problem with this test is that your smaller tables probably fit in
memory whereas the big ones may not, so it's not a given that your test
accurately reflects how the real query will go down.
That's a very helpful reminder. Thanks.
Kynn