Thread: Why is seq search preferred here by planner?

Why is seq search preferred here by planner?

From
Date:
Hi Folks,

I want to update the country feild in table email_bank by looking
up another table personal_account_details by matching userid.
any anyone tellme why an seq scan on personal_account_details
for getting the coutry. the query is:

UPDATE email_bank set country=personal_account_details.country where userid > 0 and
userid=personal_account_details.userid  ;

both tables have been vacuum analyzed before running the queries.
my question is is the the best plan ?

becoz i most get similar plan when i update in
this manner.

Its quite a decent IBM xseries server with
2GB ram   and 4  Intel(R) XEON(TM) CPU 2.00GHz.


and explain analyze is below:

tradein_clients=# begin work;explain analyze UPDATE email_bank set
country=personal_account_details.country where userid > 0 and
userid=personal_account_details.userid  ;BEGIN
Time: 669.57 ms
                                                                 QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
HashJoin  (cost=14496.79..36722.98 rows=216658 width=137) (actual time=3505.31..49707.67rows=150756 loops=1)   Hash
Cond:("outer".userid = "inner".userid)  ->  Seq Scan on email_bank  (cost=0.00..16268.10 rows=216658 width=123) (actual
time=0.06..43033.91 rows=155714 loops=1)         Filter: (userid > 0)  ->  Hash  (cost=14113.23..14113.23 rows=153423
width=14)(actual time=3505.08..3505.08 rows=0  loops=1)         ->  Seq Scan on personal_account_details
(cost=0.00..14113.23rows=153423 width=14)        (actual time=0.04..3196.12 rows=153431 loops=1) Total runtime:
209472.14msec 
(7 rows)

Regds
Mallah.



-----------------------------------------
Get your free web based email at trade-india.com.  "India's Leading B2B eMarketplace.!"
http://www.trade-india.com/



Re: Why is seq search preferred here by planner?

From
Josh Berkus
Date:
Mallah,

> UPDATE email_bank set country=personal_account_details.country where userid
> > 0 and userid=personal_account_details.userid  ;

While Postgres is letting you "fudge", you are missing part of that SQL
statement.  It should be:
UPDATE email_bank set country=personal_account_details.countryFROM personal_account_detailswhere userid > 0 and
userid=personal_account_details.userid ; 

> tradein_clients=# begin work;explain analyze UPDATE email_bank set
> country=personal_account_details.country where userid > 0 and
> userid=personal_account_details.userid  ;BEGIN
> Time: 669.57 ms

Hmmm ... 669 ms seems pretty speedy for updating 150,000 rows.   What's the
problem?

--
Josh Berkus
Aglio Database Solutions
San Francisco



Re: Why is seq search preferred here by planner?

From
Date:
> Mallah,
>
>> UPDATE email_bank set country=personal_account_details.country where userid
>> > 0 and userid=personal_account_details.userid  ;
>
> While Postgres is letting you "fudge", you are missing part of that SQL  statement.  It should
> be:
>
> UPDATE email_bank set country=personal_account_details.country
> FROM personal_account_details
> where userid > 0 and userid=personal_account_details.userid  ;

yes this is the proper SQL , but that also works and have same results.

>
>> tradein_clients=# begin work;explain analyze UPDATE email_bank set
>> country=personal_account_details.country where userid > 0 and
>> userid=personal_account_details.userid  ;BEGIN
>> Time: 669.57 ms
>
> Hmmm ... 669 ms seems pretty speedy for updating 150,000 rows.   What's the  problem?

Hmm i am not running postgresql on a CRAY  :-)
that was the time for "begin work;"  since
i am explain analysing an update i am putting it in transaction.

the actualt time was nearing 300 secs which is 5 mins

i did an explain of the UPDATE FROM variant of the query it has the same
plan for it.

regds
mallah.




>
> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco



-----------------------------------------
Get your free web based email at trade-india.com.  "India's Leading B2B eMarketplace.!"
http://www.trade-india.com/



Re: Why is seq search preferred here by planner?

From
Josh Berkus
Date:
Mallah,

> Hmm i am not running postgresql on a CRAY  :-)
>
>  that was the time for "begin work;"  since
> i am explain analysing an update i am putting it in transaction.
>
> the actualt time was nearing 300 secs which is 5 mins
>
> i did an explain of the UPDATE FROM variant of the query it has the same
> plan for it.

It's the same query.  When you forget the FROM, postgres tries to insert it
for you ... in fact, you generally get a notice like "Inserting missing FROM
clause for table 'personal_account_details'".

Am I reading your anaylze right?  I'm never 100% sure with 7.2 ... you *are*
updating 150,000 rows?

If I'm right, then the query is using a seq scan because it's faster than an
index scan for a large number of rows.   You can always test this by runninng
the query after an SET ENABLE_SEQSCAN = FALSE; generally the resulting query
will take 5-10 times as long.

Your query delay time is updating 150,000 rows, period.   That's a *lot* of
disk activity, and it can only be improved by adjusting your postgresql.conf,
the disk locations of your files, the speed of your disk array, and your I/O
bandwidth.

How long does this take?

UPDATE email_bank set country=personal_account_details.countryFROM personal_account_detailswhere userid > 0 and
userid=personal_account_details.userid    and email_bank.country <> personal_account_details.countryor
email_bank.countryIS NULL;  

--
Josh Berkus
Aglio Database Solutions
San Francisco



Re: Why is seq search preferred here by planner?

From
Date:
> Mallah,
>
>> Hmm i am not running postgresql on a CRAY  :-)
>>
>>  that was the time for "begin work;"  since
>> i am explain analysing an update i am putting it in transaction.
>>
>> the actualt time was nearing 300 secs which is 5 mins
>>
>> i did an explain of the UPDATE FROM variant of the query it has the same plan for it.
>
> It's the same query.  When you forget the FROM, postgres tries to insert it  for you ... in
> fact, you generally get a notice like "Inserting missing FROM  clause for table
> 'personal_account_details'".

Josh when i get that such  Notices the result of update is usually worng.
I do not get that notice for the SQL i posted.

>
> Am I reading your anaylze right?
i think u may have missed parts of it.
 I'm never 100% sure with 7.2 ... you *are*  updating 150,000
> rows?

7.2 u mean ,  pgsql 7.2 ?  No i am running pgsql 7.3.2

yes i did update 1,50,000 rows that time.

>
> If I'm right, then the query is using a seq scan because it's faster than an  index scan for a
> large number of rows.   You can always test this by runninng  the query after an SET
> ENABLE_SEQSCAN = FALSE; generally the resulting query  will take 5-10 times as long.


Yes as explained in docs (somewhere ) and by many ppl seq scan is faster when
majority of  the rows are stisfy the search  criteria.

my botheration was that if pgsql were to search the personal_account_details with userid
(which is incidently the pkey there) why would it not use the uniq  index  ?
logically the query shud be looking the coutry feild from the personal_account_details
for every userid (row) encoutered in email_bank table.

may be i am wrong here in correctly under standing the  explain output.

tradein_clients=# explain UPDATE email_bank set country=personal_account_details.country FROM
personal_account_details where email_bank.userid > 0 and
email_bank.userid=personal_account_details.userid and ( email_bank.country <>
personal_account_details.country or email_bank.country IS NULL );                                          QUERY PLAN
----------------------------------------------------------------------------------------------Hash Join
(cost=14497.06..37806.55rows=186849 width=144)  Hash Cond: ("outer".userid = "inner".userid)  Join Filter:
(("outer".country<> "inner".country) OR ("outer".country IS NULL))  ->  Seq Scan on email_bank  (cost=0.00..16268.10
rows=216658width=130)        Filter: (userid > 0)  ->  Hash  (cost=14113.45..14113.45 rows=153445 width=14)        ->
SeqScan on personal_account_details  (cost=0.00..14113.45 rows=153445 width=14) 
(7 rows)

I am bothered abt the second seq scan , sorry i do not understand "Hash" is there any docs that
explain me that ?

>
> Your query delay time is updating 150,000 rows, period.   That's a *lot* of  disk activity, and
> it can only be improved by adjusting your postgresql.conf,  the disk locations of your files,
> the speed of your disk array, and your I/O  bandwidth.

my WAL logs (pg_xlog) are already in a dedicated disk [ buts thats a different topic ]

>
> How long does this take?
>
> UPDATE email_bank set country=personal_account_details.country
> FROM personal_account_details
> where userid > 0 and userid=personal_account_details.userid
>     and email_bank.country <> personal_account_details.country
>     or email_bank.country IS NULL;

Yes It took very less , because there were actually very less rows to update .
I agree the  "email_bank.country <> personal_account_details.country     or email_bank.country IS NULL"
was a neat thing to do in last query which i didnt'   :-(

this time since most of the rows were already equal becoz i updated them a little while
back with that 5 mins query ;-)

anyway could u plez explain the second "seq scan part" ?



HERE WAS THE EXPLIAN ANALYZE OUTPUT:

hey dont you think the parentheis i added were necessary for the query
to update proper rows ? becoz the  "or" part could have evaluated  to true even for
no matching userids resulting in update of unecessary rows.


tradein_clients=# begin work ; explain analyze UPDATE email_bank set
country=personal_account_details.country FROM personal_account_details where email_bank.userid > 0
and email_bank.userid=personal_account_details.userid and ( email_bank.country <>
personal_account_details.country or email_bank.country IS NULL );BEGIN
Time: 720.04 ms                                                                 QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
HashJoin  (cost=14497.06..37806.55 rows=186849 width=144) (actual time=5874.10..8754.98rows=1033 loops=1)   Hash Cond:
("outer".userid= "inner".userid)  Join Filter: (("outer".country <> "inner".country) OR ("outer".country IS NULL))  ->
SeqScan on email_bank  (cost=0.00..16268.10 rows=216658 width=130) (actual  time=0.83..1361.35 rows=156669 loops=1)
   Filter: (userid > 0)  ->  Hash  (cost=14113.45..14113.45 rows=153445 width=14) (actual time=5855.59..5855.59 rows=0
loops=1)        ->  Seq Scan on personal_account_details  (cost=0.00..14113.45 rows=153445 width=14)        (actual
time=0.04..5550.93rows=153466 loops=1) Total runtime: 9370.74 msec 
(8 rows)






>
> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco
>
>
> ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9'
> the postmaster



-----------------------------------------
Get your free web based email at trade-india.com.  "India's Leading B2B eMarketplace.!"
http://www.trade-india.com/



Re: Why is seq search preferred here by planner?

From
Josh Berkus
Date:
Mallah,

> tradein_clients=# begin work ; explain analyze UPDATE email_bank set
> country=personal_account_details.country FROM personal_account_details where
email_bank.userid > 0
> and email_bank.userid=personal_account_details.userid and (
email_bank.country <>
> personal_account_details.country or email_bank.country IS NULL );BEGIN

Ooops, yeah, you're correct ... the parens are required, I just forgot them.

The reason that the planner is using a seq scan on personal_account_details is
the same as the reason for using  a seq scan on email_bank; the number of
rows which match the condition, about 150,000.   With that many qualifying
rows, a seq scan is faster.

How often do you do this query?  If it's frequent, then running my version of
the query with two new indexes -- one on email_bank.userid,
email_bank.country, and one on personal_account_details.userid, country --
would be a interesting test with my version of the query.  Though I'm not
sure about the email_bank.country IS NULL condition; workarounds, anyone?

If you run this query only once a day, don't worry about it; run my version of
the query, and it should finish in < 30 seconds, and that should be good
enough, yes?

Oh, and a "Hash" is a "Hash Join" -- where the DB basically throws all of the
rows from both tables in a big mess and picks out the ones that match.

--
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Why is seq search preferred here by planner?

From
Date:
> Mallah,
>
>> tradein_clients=# begin work ; explain analyze UPDATE email_bank set
>> country=personal_account_details.country FROM personal_account_details where
> email_bank.userid > 0
>> and email_bank.userid=personal_account_details.userid and (
> email_bank.country <>
>> personal_account_details.country or email_bank.country IS NULL );BEGIN
>
> Ooops, yeah, you're correct ... the parens are required, I just forgot them.
>
> The reason that the planner is using a seq scan on personal_account_details is  the same as the
> reason for using  a seq scan on email_bank; the number of  rows which match the condition,
> about 150,000.   With that many qualifying  rows, a seq scan is faster.


But there are two tables here , email_bank and personal_account_details
in personal account details only one row is supposed to match a given userid
as userid is the PKEY , why seq_scan there ? or am i getting the explain wrong ?


>
> How often do you do this query?  If it's frequent, then running my version of  the query with
> two new indexes -- one on email_bank.userid,
> email_bank.country, and one on personal_account_details.userid, country --  would be a
> interesting test with my version of the query.  Though I'm not  sure about the
> email_bank.country IS NULL condition; workarounds, anyone?

Hey this query is not even part of my production system,

I am doing a massive import of all the mess i have created since
last year ;-) into a strcity normal form with lots of RIs and checks.


>
> If you run this query only once a day, don't worry about it; run my version of  the query, and
> it should finish in < 30 seconds, and that should be good  enough, yes?

Not applicable!

>
> Oh, and a "Hash" is a "Hash Join" -- where the DB basically throws all of the  rows from both
> tables in a big mess and picks out the ones that match.      :- )


Hmm..  although we have beaten around the bush  but my question is still not answered
"why seq scan on a pkey " ??  (or am i getting the explain wrong )

anyway nevermind as  my update is already over and i have added an
FKEY  on the coutry feild to prevent collecting future mess.

Warm Regds
Mallah




>
> --
> -Josh Berkus
> Aglio Database Solutions
> San Francisco
>
>
> ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and
> unsubscribe commands go to majordomo@postgresql.org



-----------------------------------------
Get your free web based email at trade-india.com.  "India's Leading B2B eMarketplace.!"
http://www.trade-india.com/



Re: Why is seq search preferred here by planner?

From
Josh Berkus
Date:
Mallah,

> > The reason that the planner is using a seq scan on
personal_account_details is  the same as the
> > reason for using  a seq scan on email_bank; the number of  rows which
match the condition,
> > about 150,000.   With that many qualifying  rows, a seq scan is faster.
>
>
> But there are two tables here , email_bank and personal_account_details
> in personal account details only one row is supposed to match a given userid
> as userid is the PKEY , why seq_scan there ? or am i getting the explain
wrong ?

It doesn't matter whether it's a primary key or not.  If your query or
subquery condition returns a large number/proportion of rows, a seq scan is
going to be faster than an index scan.

As far as userid being a key, this affects the planner's *join method*, *not*
how the database gets the rows.


--
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Why is seq search preferred here by planner?

From
Date:
> Mallah,
>
>> > The reason that the planner is using a seq scan on
> personal_account_details is  the same as the
>> > reason for using  a seq scan on email_bank; the number of  rows which
> match the condition,
>> > about 150,000.   With that many qualifying  rows, a seq scan is faster.
>>
>>
>> But there are two tables here , email_bank and personal_account_details in personal account
>> details only one row is supposed to match a given userid as userid is the PKEY , why seq_scan
>> there ? or am i getting the explain
> wrong ?
>
> It doesn't matter whether it's a primary key or not.  If your query or  subquery condition
> returns a large number/proportion of rows, a seq scan is  going to be faster than an index
> scan.
>
> As far as userid being a key, this affects the planner's *join method*, *not*  how the database
> gets the rows.

Got it , you mean to say userid being pkey  is affecting the method of
implicit join being made by the update  stmt.

thanks once again.

Regds
mallah.


>
>
> --
> -Josh Berkus
> Aglio Database Solutions
> San Francisco



-----------------------------------------
Get your free web based email at trade-india.com.  "India's Leading B2B eMarketplace.!"
http://www.trade-india.com/



Re: Why is seq search preferred here by planner?

From
Josh Berkus
Date:
Mallah,

> Got it , you mean to say userid being pkey  is affecting the method of
> implicit join being made by the update  stmt.

More to the point, the PKey index doesn't help the db retrieve the rows to
join at all.  It might if only 1,000 rows matched, but not with 150,000
matching rows.

-Josh Berkus