Thread: Efficient Correlated Update

Efficient Correlated Update

From
Robert DiFalco
Date:
In my system a user can have external contacts. When I am bringing in external contacts I want to correlate any other existing users in the system with those external contacts. A users external contacts may or may not be users in my system. I have a user_id field in "contacts" that is NULL if that contact is not a user in my system

Currently I do something like this after reading in external contacts:

   UPDATE contacts SET user_id = u.id
   FROM my_users u 
   JOIN phone_numbers pn ON u.phone_significant = pn.significant 
   WHERE contacts.owner_id = 7 AND contacts.user_id IS NULL AND contacts.id = pn.ref_contact_id;

If any of the fields are not self explanatory let me know. "Significant" is just the right 7 most digits of a raw phone number. 

I'm more interested in possible improvements to my relational logic than the details of the "significant" condition. IOW, I'm start enough to optimize the "significant" query but not smart enough to know if this is the best approach for the overall correlated UPDATE query. :)

So yeah, is this the best way to update a contact's user_id reference based on a contacts phone number matching the phone number of a user?

One detail from the schema -- A contact can have many phone numbers but a user in my system will only ever have just one phone number. Hence the JOIN to "phone_numbers" versus the column in "my_users".

Thanks.

Re: Efficient Correlated Update

From
Robert DiFalco
Date:
Guys, let me know if I have not provided enough information on this post. Thanks!


On Thu, Aug 8, 2013 at 11:06 AM, Robert DiFalco <robert.difalco@gmail.com> wrote:
In my system a user can have external contacts. When I am bringing in external contacts I want to correlate any other existing users in the system with those external contacts. A users external contacts may or may not be users in my system. I have a user_id field in "contacts" that is NULL if that contact is not a user in my system

Currently I do something like this after reading in external contacts:

   UPDATE contacts SET user_id = u.id
   FROM my_users u 
   JOIN phone_numbers pn ON u.phone_significant = pn.significant 
   WHERE contacts.owner_id = 7 AND contacts.user_id IS NULL AND contacts.id = pn.ref_contact_id;

If any of the fields are not self explanatory let me know. "Significant" is just the right 7 most digits of a raw phone number. 

I'm more interested in possible improvements to my relational logic than the details of the "significant" condition. IOW, I'm start enough to optimize the "significant" query but not smart enough to know if this is the best approach for the overall correlated UPDATE query. :)

So yeah, is this the best way to update a contact's user_id reference based on a contacts phone number matching the phone number of a user?

One detail from the schema -- A contact can have many phone numbers but a user in my system will only ever have just one phone number. Hence the JOIN to "phone_numbers" versus the column in "my_users".

Thanks.

Re: Efficient Correlated Update

From
Kevin Grittner
Date:
Robert DiFalco <robert.difalco@gmail.com> wrote:

> In my system a user can have external contacts. When I am
> bringing in external contacts I want to correlate any other
> existing users in the system with those external contacts. A
> users external contacts may or may not be users in my system. I
> have a user_id field in "contacts" that is NULL if that contact
> is not a user in my system
>
> Currently I do something like this after reading in external
> contacts:
>
>   UPDATE contacts SET user_id = u.id
>   FROM my_users u
>   JOIN phone_numbers pn ON u.phone_significant = pn.significant
>   WHERE contacts.owner_id = 7
>   AND contacts.user_id IS NULL
>   AND contacts.id = pn.ref_contact_id;
>
> If any of the fields are not self explanatory let me know.
> "Significant" is just the right 7 most digits of a raw phone
> number.
>
> I'm more interested in possible improvements to my relational
> logic than the details of the "significant" condition. IOW, I'm
> start enough to optimize the "significant" query but not smart
> enough to know if this is the best approach for the overall
> correlated UPDATE query. :)
>
> So yeah, is this the best way to update a contact's user_id
> reference based on a contacts phone number matching the phone
> number of a user?
>
> One detail from the schema -- A contact can have many phone
> numbers but a user in my system will only ever have just one
> phone number. Hence the JOIN to "phone_numbers" versus the column
> in "my_users".
 
In looking it over, nothing jumped out at me as a problem.  Are you
having some problem with it, like poor performance or getting
results different from what you expected?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Efficient Correlated Update

From
Klaus Ita
Date:
I sometimes experience that updating smaller sets is more efficient than doing all at once in one transaction (talking about 10000+)

Always make sure the update references can make use of indices


On Fri, Aug 9, 2013 at 5:44 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Robert DiFalco <robert.difalco@gmail.com> wrote:

> In my system a user can have external contacts. When I am
> bringing in external contacts I want to correlate any other
> existing users in the system with those external contacts. A
> users external contacts may or may not be users in my system. I
> have a user_id field in "contacts" that is NULL if that contact
> is not a user in my system
>
> Currently I do something like this after reading in external
> contacts:
>
>   UPDATE contacts SET user_id = u.id
>   FROM my_users u
>   JOIN phone_numbers pn ON u.phone_significant = pn.significant
>   WHERE contacts.owner_id = 7
>   AND contacts.user_id IS NULL
>   AND contacts.id = pn.ref_contact_id;
>
> If any of the fields are not self explanatory let me know.
> "Significant" is just the right 7 most digits of a raw phone
> number.
>
> I'm more interested in possible improvements to my relational
> logic than the details of the "significant" condition. IOW, I'm
> start enough to optimize the "significant" query but not smart
> enough to know if this is the best approach for the overall
> correlated UPDATE query. :)
>
> So yeah, is this the best way to update a contact's user_id
> reference based on a contacts phone number matching the phone
> number of a user?
>
> One detail from the schema -- A contact can have many phone
> numbers but a user in my system will only ever have just one
> phone number. Hence the JOIN to "phone_numbers" versus the column
> in "my_users".
 
In looking it over, nothing jumped out at me as a problem.  Are you
having some problem with it, like poor performance or getting
results different from what you expected?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Efficient Correlated Update

From
Robert DiFalco
Date:
Well, heh I'm no SQL expert. I kinda piece things together the best I can from what I can read and this was really the only way I could make the UPDATE work correctly. But the plan looks complicated with a lot of hash conditions, hash joins, and scans. I'm worried it wont perform with a very large dataset.

Here's the plan:

Update on public.contacts  (cost=16.64..27.22 rows=42 width=163) (actual time=1.841..1.841 rows=0 loops=1)
   ->  Hash Join  (cost=16.64..27.22 rows=42 width=163) (actual time=1.837..1.837 rows=0 loops=1)
         Output: contacts.dtype, contacts.id, contacts.blocked, contacts.fname, contacts.last_call, contacts.lname, contacts.hash, contacts.record_id, contacts.fb_id, contacts.owner_id, u.id, contacts.device, contacts.ctid, u.ctid, e.ctid
         Hash Cond: ((u.phone_short)::text = (e.significant)::text)
         ->  Seq Scan on public.wai_users u  (cost=0.00..10.36 rows=120 width=46) (actual time=0.022..0.028 rows=6 loops=1)
               Output: u.id, u.ctid, u.phone_short
         ->  Hash  (cost=16.24..16.24 rows=116 width=157) (actual time=1.744..1.744 rows=87 loops=1)
               Output: contacts.dtype, contacts.id, contacts.blocked, contacts.fname, contacts.last_call, contacts.lname, contacts.hash, contacts.record_id, contacts.fb_id, contacts.owner_id, contacts.device, contacts.ctid, e.ctid, e.significant
               Buckets: 1024  Batches: 1  Memory Usage: 12kB
               ->  Hash Join  (cost=10.47..16.24 rows=116 width=157) (actual time=0.636..1.583 rows=87 loops=1)
                     Output: contacts.dtype, contacts.id, contacts.blocked, contacts.fname, contacts.last_call, contacts.lname, contacts.hash, contacts.record_id, contacts.fb_id, contacts.owner_id, contacts.device, contacts.ctid, e.ctid, e.significant
                     Hash Cond: (e.owner_id = contacts.id)
                     ->  Seq Scan on public.phone_numbers e  (cost=0.00..5.13 rows=378 width=22) (actual time=0.008..0.467 rows=378 loops=1)
                           Output: e.ctid, e.significant, e.owner_id
                     ->  Hash  (cost=9.89..9.89 rows=166 width=143) (actual time=0.578..0.578 rows=124 loops=1)
                           Output: contacts.dtype, contacts.id, contacts.blocked, contacts.fname, contacts.last_call, contacts.lname, contacts.hash, contacts.record_id, contacts.fb_id, contacts.owner_id, contacts.device, contacts.ctid
                           Buckets: 1024  Batches: 1  Memory Usage: 16kB
                           ->  Seq Scan on public.contacts  (cost=0.00..9.89 rows=166 width=143) (actual time=0.042..0.365 rows=124 loops=1)
                                 Output: contacts.dtype, contacts.id, contacts.blocked, contacts.fname, contacts.last_call, contacts.lname, contacts.hash, contacts.record_id, contacts.fb_id, contacts.owner_id, contacts.device, contacts.ctid
                                 Filter: ((contacts.user_id IS NULL) AND (contacts.owner_id = 7))
                                 Rows Removed by Filter: 290
 Total runtime: 2.094 ms
(22 rows)

If I wasn't having to update I could write a query like this which seems like it has a much better plan:

dfmvu2a0bvs93n=> explain analyze verbose SELECT c.id                                                                                                                                       FROM wai_users u                                                                                                                                                                          JOIN phone_numbers e ON u.phone_short = e.significant                                                                                                                                     JOIN contacts c ON c.id = e.owner_id                                                                                                                                                      WHERE c.owner_id = 5 AND c.user_id IS NULL                                                                                                                                               ;                                                                                                                                                                                                                                                            QUERY PLAN                                                                    -------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..7.18 rows=1 width=8) (actual time=0.091..0.091 rows=0 loops=1)
   Output: c.id
   ->  Nested Loop  (cost=0.00..7.06 rows=1 width=16) (actual time=0.089..0.089 rows=0 loops=1)
         Output: e.significant, c.id
         ->  Index Scan using idx_contact_owner on public.contacts c  (cost=0.00..3.00 rows=1 width=8) (actual time=0.086..0.086 rows=0 loops=1)
               Output: c.dtype, c.id, c.blocked, c.fname, c.last_call, c.lname, c.hash, c.record_id, c.fb_id, c.owner_id, c.user_id, c.device
               Index Cond: (c.owner_id = 5)
               Filter: (c.user_id IS NULL)
         ->  Index Scan using idx_phone_owner on public.phone_numbers e  (cost=0.00..4.06 rows=1 width=16) (never executed)
               Output: e.id, e.raw_number, e.significant, e.owner_id
               Index Cond: (e.owner_id = c.id)
   ->  Index Only Scan using idx_user_short_phone on public.wai_users u  (cost=0.00..0.12 rows=1 width=32) (never executed)
         Output: u.phone_short
         Index Cond: (u.phone_short = (e.significant)::text)
         Heap Fetches: 0
 Total runtime: 0.158 ms
(16 rows)




On Fri, Aug 9, 2013 at 8:44 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
Robert DiFalco <robert.difalco@gmail.com> wrote:

> In my system a user can have external contacts. When I am
> bringing in external contacts I want to correlate any other
> existing users in the system with those external contacts. A
> users external contacts may or may not be users in my system. I
> have a user_id field in "contacts" that is NULL if that contact
> is not a user in my system
>
> Currently I do something like this after reading in external
> contacts:
>
>   UPDATE contacts SET user_id = u.id
>   FROM my_users u
>   JOIN phone_numbers pn ON u.phone_significant = pn.significant
>   WHERE contacts.owner_id = 7
>   AND contacts.user_id IS NULL
>   AND contacts.id = pn.ref_contact_id;
>
> If any of the fields are not self explanatory let me know.
> "Significant" is just the right 7 most digits of a raw phone
> number.
>
> I'm more interested in possible improvements to my relational
> logic than the details of the "significant" condition. IOW, I'm
> start enough to optimize the "significant" query but not smart
> enough to know if this is the best approach for the overall
> correlated UPDATE query. :)
>
> So yeah, is this the best way to update a contact's user_id
> reference based on a contacts phone number matching the phone
> number of a user?
>
> One detail from the schema -- A contact can have many phone
> numbers but a user in my system will only ever have just one
> phone number. Hence the JOIN to "phone_numbers" versus the column
> in "my_users".
 
In looking it over, nothing jumped out at me as a problem.  Are you
having some problem with it, like poor performance or getting
results different from what you expected?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Efficient Correlated Update

From
Igor Neyman
Date:
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-
> performance-owner@postgresql.org] On Behalf Of Kevin Grittner
> Sent: Friday, August 09, 2013 11:44 AM
> To: Robert DiFalco; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Efficient Correlated Update
>
> Robert DiFalco <robert.difalco@gmail.com> wrote:
>
> > In my system a user can have external contacts. When I am bringing in
> > external contacts I want to correlate any other existing users in the
> > system with those external contacts. A users external contacts may or
> > may not be users in my system. I have a user_id field in "contacts"
> > that is NULL if that contact is not a user in my system
> >
> > Currently I do something like this after reading in external
> > contacts:
> >
> >   UPDATE contacts SET user_id = u.id
> >   FROM my_users u
> >   JOIN phone_numbers pn ON u.phone_significant = pn.significant
> >   WHERE contacts.owner_id = 7
> >   AND contacts.user_id IS NULL
> >   AND contacts.id = pn.ref_contact_id;
> >
> > If any of the fields are not self explanatory let me know.
> > "Significant" is just the right 7 most digits of a raw phone number.
> >
> > I'm more interested in possible improvements to my relational logic
> > than the details of the "significant" condition. IOW, I'm start enough
> > to optimize the "significant" query but not smart enough to know if
> > this is the best approach for the overall correlated UPDATE query. :)
> >
> > So yeah, is this the best way to update a contact's user_id reference
> > based on a contacts phone number matching the phone number of a user?
> >
> > One detail from the schema -- A contact can have many phone numbers
> > but a user in my system will only ever have just one phone number.
> > Hence the JOIN to "phone_numbers" versus the column in "my_users".
>
> In looking it over, nothing jumped out at me as a problem.  Are you having
> some problem with it, like poor performance or getting results different from
> what you expected?
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

There is an illness that sometimes strikes database developers/administrators.
It is called CTD - Compulsive Tuning Disorder :)

Igor Neyman