Thread: Why is seq search preferred here by planner?
Hi Folks, I want to update the country feild in table email_bank by looking up another table personal_account_details by matching userid. any anyone tellme why an seq scan on personal_account_details for getting the coutry. the query is: UPDATE email_bank set country=personal_account_details.country where userid > 0 and userid=personal_account_details.userid ; both tables have been vacuum analyzed before running the queries. my question is is the the best plan ? becoz i most get similar plan when i update in this manner. Its quite a decent IBM xseries server with 2GB ram and 4 Intel(R) XEON(TM) CPU 2.00GHz. and explain analyze is below: tradein_clients=# begin work;explain analyze UPDATE email_bank set country=personal_account_details.country where userid > 0 and userid=personal_account_details.userid ;BEGIN Time: 669.57 ms QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- HashJoin (cost=14496.79..36722.98 rows=216658 width=137) (actual time=3505.31..49707.67rows=150756 loops=1) Hash Cond:("outer".userid = "inner".userid) -> Seq Scan on email_bank (cost=0.00..16268.10 rows=216658 width=123) (actual time=0.06..43033.91 rows=155714 loops=1) Filter: (userid > 0) -> Hash (cost=14113.23..14113.23 rows=153423 width=14)(actual time=3505.08..3505.08 rows=0 loops=1) -> Seq Scan on personal_account_details (cost=0.00..14113.23rows=153423 width=14) (actual time=0.04..3196.12 rows=153431 loops=1) Total runtime: 209472.14msec (7 rows) Regds Mallah. ----------------------------------------- Get your free web based email at trade-india.com. "India's Leading B2B eMarketplace.!" http://www.trade-india.com/
Mallah, > UPDATE email_bank set country=personal_account_details.country where userid > > 0 and userid=personal_account_details.userid ; While Postgres is letting you "fudge", you are missing part of that SQL statement. It should be: UPDATE email_bank set country=personal_account_details.countryFROM personal_account_detailswhere userid > 0 and userid=personal_account_details.userid ; > tradein_clients=# begin work;explain analyze UPDATE email_bank set > country=personal_account_details.country where userid > 0 and > userid=personal_account_details.userid ;BEGIN > Time: 669.57 ms Hmmm ... 669 ms seems pretty speedy for updating 150,000 rows. What's the problem? -- Josh Berkus Aglio Database Solutions San Francisco
> Mallah, > >> UPDATE email_bank set country=personal_account_details.country where userid >> > 0 and userid=personal_account_details.userid ; > > While Postgres is letting you "fudge", you are missing part of that SQL statement. It should > be: > > UPDATE email_bank set country=personal_account_details.country > FROM personal_account_details > where userid > 0 and userid=personal_account_details.userid ; yes this is the proper SQL , but that also works and have same results. > >> tradein_clients=# begin work;explain analyze UPDATE email_bank set >> country=personal_account_details.country where userid > 0 and >> userid=personal_account_details.userid ;BEGIN >> Time: 669.57 ms > > Hmmm ... 669 ms seems pretty speedy for updating 150,000 rows. What's the problem? Hmm i am not running postgresql on a CRAY :-) that was the time for "begin work;" since i am explain analysing an update i am putting it in transaction. the actualt time was nearing 300 secs which is 5 mins i did an explain of the UPDATE FROM variant of the query it has the same plan for it. regds mallah. > > -- > Josh Berkus > Aglio Database Solutions > San Francisco ----------------------------------------- Get your free web based email at trade-india.com. "India's Leading B2B eMarketplace.!" http://www.trade-india.com/
Mallah, > Hmm i am not running postgresql on a CRAY :-) > > that was the time for "begin work;" since > i am explain analysing an update i am putting it in transaction. > > the actualt time was nearing 300 secs which is 5 mins > > i did an explain of the UPDATE FROM variant of the query it has the same > plan for it. It's the same query. When you forget the FROM, postgres tries to insert it for you ... in fact, you generally get a notice like "Inserting missing FROM clause for table 'personal_account_details'". Am I reading your anaylze right? I'm never 100% sure with 7.2 ... you *are* updating 150,000 rows? If I'm right, then the query is using a seq scan because it's faster than an index scan for a large number of rows. You can always test this by runninng the query after an SET ENABLE_SEQSCAN = FALSE; generally the resulting query will take 5-10 times as long. Your query delay time is updating 150,000 rows, period. That's a *lot* of disk activity, and it can only be improved by adjusting your postgresql.conf, the disk locations of your files, the speed of your disk array, and your I/O bandwidth. How long does this take? UPDATE email_bank set country=personal_account_details.countryFROM personal_account_detailswhere userid > 0 and userid=personal_account_details.userid and email_bank.country <> personal_account_details.countryor email_bank.countryIS NULL; -- Josh Berkus Aglio Database Solutions San Francisco
> Mallah, > >> Hmm i am not running postgresql on a CRAY :-) >> >> that was the time for "begin work;" since >> i am explain analysing an update i am putting it in transaction. >> >> the actualt time was nearing 300 secs which is 5 mins >> >> i did an explain of the UPDATE FROM variant of the query it has the same plan for it. > > It's the same query. When you forget the FROM, postgres tries to insert it for you ... in > fact, you generally get a notice like "Inserting missing FROM clause for table > 'personal_account_details'". Josh when i get that such Notices the result of update is usually worng. I do not get that notice for the SQL i posted. > > Am I reading your anaylze right? i think u may have missed parts of it. I'm never 100% sure with 7.2 ... you *are* updating 150,000 > rows? 7.2 u mean , pgsql 7.2 ? No i am running pgsql 7.3.2 yes i did update 1,50,000 rows that time. > > If I'm right, then the query is using a seq scan because it's faster than an index scan for a > large number of rows. You can always test this by runninng the query after an SET > ENABLE_SEQSCAN = FALSE; generally the resulting query will take 5-10 times as long. Yes as explained in docs (somewhere ) and by many ppl seq scan is faster when majority of the rows are stisfy the search criteria. my botheration was that if pgsql were to search the personal_account_details with userid (which is incidently the pkey there) why would it not use the uniq index ? logically the query shud be looking the coutry feild from the personal_account_details for every userid (row) encoutered in email_bank table. may be i am wrong here in correctly under standing the explain output. tradein_clients=# explain UPDATE email_bank set country=personal_account_details.country FROM personal_account_details where email_bank.userid > 0 and email_bank.userid=personal_account_details.userid and ( email_bank.country <> personal_account_details.country or email_bank.country IS NULL ); QUERY PLAN ----------------------------------------------------------------------------------------------Hash Join (cost=14497.06..37806.55rows=186849 width=144) Hash Cond: ("outer".userid = "inner".userid) Join Filter: (("outer".country<> "inner".country) OR ("outer".country IS NULL)) -> Seq Scan on email_bank (cost=0.00..16268.10 rows=216658width=130) Filter: (userid > 0) -> Hash (cost=14113.45..14113.45 rows=153445 width=14) -> SeqScan on personal_account_details (cost=0.00..14113.45 rows=153445 width=14) (7 rows) I am bothered abt the second seq scan , sorry i do not understand "Hash" is there any docs that explain me that ? > > Your query delay time is updating 150,000 rows, period. That's a *lot* of disk activity, and > it can only be improved by adjusting your postgresql.conf, the disk locations of your files, > the speed of your disk array, and your I/O bandwidth. my WAL logs (pg_xlog) are already in a dedicated disk [ buts thats a different topic ] > > How long does this take? > > UPDATE email_bank set country=personal_account_details.country > FROM personal_account_details > where userid > 0 and userid=personal_account_details.userid > and email_bank.country <> personal_account_details.country > or email_bank.country IS NULL; Yes It took very less , because there were actually very less rows to update . I agree the "email_bank.country <> personal_account_details.country or email_bank.country IS NULL" was a neat thing to do in last query which i didnt' :-( this time since most of the rows were already equal becoz i updated them a little while back with that 5 mins query ;-) anyway could u plez explain the second "seq scan part" ? HERE WAS THE EXPLIAN ANALYZE OUTPUT: hey dont you think the parentheis i added were necessary for the query to update proper rows ? becoz the "or" part could have evaluated to true even for no matching userids resulting in update of unecessary rows. tradein_clients=# begin work ; explain analyze UPDATE email_bank set country=personal_account_details.country FROM personal_account_details where email_bank.userid > 0 and email_bank.userid=personal_account_details.userid and ( email_bank.country <> personal_account_details.country or email_bank.country IS NULL );BEGIN Time: 720.04 ms QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- HashJoin (cost=14497.06..37806.55 rows=186849 width=144) (actual time=5874.10..8754.98rows=1033 loops=1) Hash Cond: ("outer".userid= "inner".userid) Join Filter: (("outer".country <> "inner".country) OR ("outer".country IS NULL)) -> SeqScan on email_bank (cost=0.00..16268.10 rows=216658 width=130) (actual time=0.83..1361.35 rows=156669 loops=1) Filter: (userid > 0) -> Hash (cost=14113.45..14113.45 rows=153445 width=14) (actual time=5855.59..5855.59 rows=0 loops=1) -> Seq Scan on personal_account_details (cost=0.00..14113.45 rows=153445 width=14) (actual time=0.04..5550.93rows=153466 loops=1) Total runtime: 9370.74 msec (8 rows) > > -- > Josh Berkus > Aglio Database Solutions > San Francisco > > > ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' > the postmaster ----------------------------------------- Get your free web based email at trade-india.com. "India's Leading B2B eMarketplace.!" http://www.trade-india.com/
Mallah, > tradein_clients=# begin work ; explain analyze UPDATE email_bank set > country=personal_account_details.country FROM personal_account_details where email_bank.userid > 0 > and email_bank.userid=personal_account_details.userid and ( email_bank.country <> > personal_account_details.country or email_bank.country IS NULL );BEGIN Ooops, yeah, you're correct ... the parens are required, I just forgot them. The reason that the planner is using a seq scan on personal_account_details is the same as the reason for using a seq scan on email_bank; the number of rows which match the condition, about 150,000. With that many qualifying rows, a seq scan is faster. How often do you do this query? If it's frequent, then running my version of the query with two new indexes -- one on email_bank.userid, email_bank.country, and one on personal_account_details.userid, country -- would be a interesting test with my version of the query. Though I'm not sure about the email_bank.country IS NULL condition; workarounds, anyone? If you run this query only once a day, don't worry about it; run my version of the query, and it should finish in < 30 seconds, and that should be good enough, yes? Oh, and a "Hash" is a "Hash Join" -- where the DB basically throws all of the rows from both tables in a big mess and picks out the ones that match. -- -Josh BerkusAglio Database SolutionsSan Francisco
> Mallah, > >> tradein_clients=# begin work ; explain analyze UPDATE email_bank set >> country=personal_account_details.country FROM personal_account_details where > email_bank.userid > 0 >> and email_bank.userid=personal_account_details.userid and ( > email_bank.country <> >> personal_account_details.country or email_bank.country IS NULL );BEGIN > > Ooops, yeah, you're correct ... the parens are required, I just forgot them. > > The reason that the planner is using a seq scan on personal_account_details is the same as the > reason for using a seq scan on email_bank; the number of rows which match the condition, > about 150,000. With that many qualifying rows, a seq scan is faster. But there are two tables here , email_bank and personal_account_details in personal account details only one row is supposed to match a given userid as userid is the PKEY , why seq_scan there ? or am i getting the explain wrong ? > > How often do you do this query? If it's frequent, then running my version of the query with > two new indexes -- one on email_bank.userid, > email_bank.country, and one on personal_account_details.userid, country -- would be a > interesting test with my version of the query. Though I'm not sure about the > email_bank.country IS NULL condition; workarounds, anyone? Hey this query is not even part of my production system, I am doing a massive import of all the mess i have created since last year ;-) into a strcity normal form with lots of RIs and checks. > > If you run this query only once a day, don't worry about it; run my version of the query, and > it should finish in < 30 seconds, and that should be good enough, yes? Not applicable! > > Oh, and a "Hash" is a "Hash Join" -- where the DB basically throws all of the rows from both > tables in a big mess and picks out the ones that match. :- ) Hmm.. although we have beaten around the bush but my question is still not answered "why seq scan on a pkey " ?? (or am i getting the explain wrong ) anyway nevermind as my update is already over and i have added an FKEY on the coutry feild to prevent collecting future mess. Warm Regds Mallah > > -- > -Josh Berkus > Aglio Database Solutions > San Francisco > > > ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and > unsubscribe commands go to majordomo@postgresql.org ----------------------------------------- Get your free web based email at trade-india.com. "India's Leading B2B eMarketplace.!" http://www.trade-india.com/
Mallah, > > The reason that the planner is using a seq scan on personal_account_details is the same as the > > reason for using a seq scan on email_bank; the number of rows which match the condition, > > about 150,000. With that many qualifying rows, a seq scan is faster. > > > But there are two tables here , email_bank and personal_account_details > in personal account details only one row is supposed to match a given userid > as userid is the PKEY , why seq_scan there ? or am i getting the explain wrong ? It doesn't matter whether it's a primary key or not. If your query or subquery condition returns a large number/proportion of rows, a seq scan is going to be faster than an index scan. As far as userid being a key, this affects the planner's *join method*, *not* how the database gets the rows. -- -Josh BerkusAglio Database SolutionsSan Francisco
> Mallah, > >> > The reason that the planner is using a seq scan on > personal_account_details is the same as the >> > reason for using a seq scan on email_bank; the number of rows which > match the condition, >> > about 150,000. With that many qualifying rows, a seq scan is faster. >> >> >> But there are two tables here , email_bank and personal_account_details in personal account >> details only one row is supposed to match a given userid as userid is the PKEY , why seq_scan >> there ? or am i getting the explain > wrong ? > > It doesn't matter whether it's a primary key or not. If your query or subquery condition > returns a large number/proportion of rows, a seq scan is going to be faster than an index > scan. > > As far as userid being a key, this affects the planner's *join method*, *not* how the database > gets the rows. Got it , you mean to say userid being pkey is affecting the method of implicit join being made by the update stmt. thanks once again. Regds mallah. > > > -- > -Josh Berkus > Aglio Database Solutions > San Francisco ----------------------------------------- Get your free web based email at trade-india.com. "India's Leading B2B eMarketplace.!" http://www.trade-india.com/
Mallah, > Got it , you mean to say userid being pkey is affecting the method of > implicit join being made by the update stmt. More to the point, the PKey index doesn't help the db retrieve the rows to join at all. It might if only 1,000 rows matched, but not with 150,000 matching rows. -Josh Berkus