Re: Improving performance of merging data between tables - Mailing list pgsql-general

From Andy Colson
Subject Re: Improving performance of merging data between tables
Date
Msg-id 54A2B2B2.9050706@squeakycode.net
Whole thread Raw
In response to Re: Improving performance of merging data between tables  (Pawel Veselov <pawel.veselov@gmail.com>)
List pgsql-general
On 12/29/2014 11:29 PM, Pawel Veselov wrote:
>
> Andy,
>
> thanks for looking into this.
>
> On Mon, Dec 29, 2014 at 9:00 AM, Andy Colson <andy@squeakycode.net <mailto:andy@squeakycode.net>> wrote:
>
>     On 12/28/2014 3:49 PM, Pawel Veselov wrote:
>
>         Hi.
>
>         I was wondering if anybody would have any ideas on how to improve
>         certain operations that we are having.
>
>     <SNIP>
>
>
>         Besides "can somebody please look at this and let me know if I'm doing
>         something utterly stupid", here are my questions.
>
>         1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
>         function? All I see is that the calls to merge_all() function take long
>         time, and the CPU is high while this is going on.
>
>
>
>     First, I'll admit I didn't read your entire post.
>
>     I can think of a couple methods:
>
>     1) try each of the statements in merge_all by hand with an "explain analyze" in front to see which is slow.  Look
forthings that hit big tables without an index.  Check that fk lookups are indexes. 
>
>
> If I didn't miss anything, that seems to be OK, even on function-based queries.
>
>     2) try pg_stat_statements, setting "pg_stat_statements.track = all".  see:
>     http://www.postgresql.org/__docs/9.4/static/__pgstatstatements.html
<http://www.postgresql.org/docs/9.4/static/pgstatstatements.html>
>
>     I have used this to profile some functions, and it worked pretty well. Mostly I use it on a test box, but once
ranit on the live, which was scary, but worked great. 
>
>
> That looks promising. Turned it on, waiting for when I can turn the server at the next "quiet time".
>
>     3) try auto-explain:
>     http://www.postgresql.org/__docs/9.4/static/auto-explain.__html
<http://www.postgresql.org/docs/9.4/static/auto-explain.html>
>
>     I've never used it, so don't know if it'll show each statement inside a function.  Dumps stuff to the log AFAIK,
soyou'll have to dig out the info by hand. 
>
>     > 2) Is there a better way to merge individual rows, except doing
>     > UPDATE/INSERT in a loop, and would that be CPU expensive?
>     >
>
>     Not that I know of.  I use pretty much the same thing.  Soon!  we will have merge/upsert support.  Hopefully
it'llbe fast. 
>
>
> Well, anytime I cancelled the PID that was executing this whole mess, it would always stop at UPDATE ... SET ...
WHEREon the main table. Which does make me believe that bulk update would really help. 
>
>     > 3) Is there a better way to merge whole tables? However, note that I
>     > need to translate primary keys from node main table into the common main
>     > table, as they are used as foreign keys, hence the loops. I suspect the
>     > looping is CPU intensive.
>
>     Avoiding loops and doing things as sets is the best way.  If possible. The only loop I saw was looping over the
merge_xxxtables, which is probably the only way. 
>
>
> There is an endless loop that is just a device for merging, but then there are loops going over each record in all
thetables that are being merge, feeding them into the function that actually does the merge. That table iteration is
whatI want to eliminate (especially if I knew it would help :) ) 
>
>     If possible (if you haven't already) you could add and extra column to your secondary table that you can set as
themain table's key. 
>
>     bulk insert into second;
>     update second set magic = (select key from main where ... );
>
>     Then, maybe, you can do two ops in batch:
>
>     update main (where key exists in main)
>     insert into main (where key not exists in main)
>
>
> I was thinking along the same lines. I can't really do bulk insert, at any point, because any key can be inserted by
anotherprocess at any time, and with a good probability. However, there will be a lot less inserts than updates. So, in
general,I'm making it do this: 
>

What about transactions?  I assume you do something like:

begin;
merge_all;
commit;


Depending on your transaction isolation level, then you could ensure that nobody could insert while you are inserting.
I'venever used the different isolation levels, so not 100% sure.  Even then, maybe a lock on the table itself, like: 

begin;
lock;
merge_all;
unlock;
commit;

This way only one at a time can work do insert/update, but you can do them in batch and not in a loop.  It might be
fasterthat way.  Other processes might wait for the lock a little bit, but if merge_all was faster in general, the time
towait for lock would be less that the entire merge process itself. 

I'm totally guessing here.

-Andy


pgsql-general by date:

Previous
From: Bernd Helmle
Date:
Subject: Re: [HACKERS] ON_ERROR_ROLLBACK
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] ON_ERROR_ROLLBACK