Re: pg_basebackup blocking all queries with horrible performance - Mailing list pgsql-admin
From | Magnus Hagander |
---|---|
Subject | Re: pg_basebackup blocking all queries with horrible performance |
Date | |
Msg-id | CABUevEzcJNNRHQNn=USd9McPShLuR4UT41ycKQJG6356ifti5A@mail.gmail.com Whole thread Raw |
In response to | Re: pg_basebackup blocking all queries with horrible performance (Lonni J Friedman <netllama@gmail.com>) |
Responses |
Re: pg_basebackup blocking all queries with horrible performance
|
List | pgsql-admin |
On Tue, Jun 12, 2012 at 8:37 PM, Lonni J Friedman <netllama@gmail.com> wrote: > On Tue, Jun 12, 2012 at 10:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Tue, Jun 12, 2012 at 2:37 AM, Lonni J Friedman <netllama@gmail.com> wrote: >>> On Fri, Jun 8, 2012 at 7:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> On Sat, Jun 9, 2012 at 4:30 AM, Lonni J Friedman <netllama@gmail.com> wrote: >>>>> On Thu, Jun 7, 2012 at 11:04 PM, Craig Ringer <ringerc@ringerc.id.au> wrote: >>>>>> On 06/08/2012 09:01 AM, Lonni J Friedman wrote: >>>>>>> >>>>>>> On Thu, Jun 7, 2012 at 5:07 PM, Jerry Sievers<gsievers19@comcast.net> >>>>>>> wrote: >>>>>>>> >>>>>>>> You might try stopping pg_basebackup in place with SIGSTOP and check >>>>>>>> >>>>>>>> if problem goes away. SIGCONT and you should start having >>>>>>>> sluggishness again. >>>>>>>> >>>>>>>> If verified, then any sort of throttling mechanism should work. >>>>>>> >>>>>>> >>>>>>> I'm certain that the problem is triggered only when pg_basebackup is >>>>>>> running. Its very predictable, and goes away as soon as pg_basebackup >>>>>>> finishes running. What do you mean by a throttling mechanism? >>>>>> >>>>>> >>>>>> Sure, it only happens when pg_basebackup is running. But if you *pause* >>>>>> pg_basebackup, so it's still running but not currently doing work, does the >>>>>> problem go away? Does it come back when you unpause pg_basebackup? That's >>>>>> what Jerry was telling you to try. >>>>>> >>>>>> If the problem goes away when you pause pg_basebackup and comes back when >>>>>> you unpause it, it's probably a system load problem. >>>>>> >>>>>> If it doesn't go away, it's more likely to be a locking issue or something >>>>>> _other_ than simple load. >>>>>> >>>>>> SIGSTOP ("kill -STOP") pauses a process, and SIGCONT ("kill -CONT") resumes >>>>>> it, so on Linux you can use these to try and find out. When you SIGSTOP >>>>>> pg_basebackup then the postgres backend associated with it should block >>>>>> shortly afterwards as its buffers fill up and it can't send more data, so >>>>>> the load should come off the server. >>>>>> >>>>>> A "throttling mechanism" refers to anything that limits the rate or speed of >>>>>> a thing. In this case, what you want to do if your problem is system >>>>>> overload is to limit the speed at which pg_basebackup does its work so other >>>>>> things can still get work done. In other words you want to throttle it. >>>>>> Typical throttling mechanisms include the "ionice" and "renice" commands to >>>>>> change I/O and CPU priority, respectively. >>>>>> >>>>>> Note that you may need to change the priority of the *backend* that >>>>>> pg_basebackup is using, not necessarily the pg_basebackup command its self. >>>>>> I haven't done enough with Pg's replication to know how that works, so >>>>>> someone else will have to fill that bit in. >>>>> >>>>> Thanks for your reply. I've confirmed that issuing a SIGSTOP does >>>>> eliminate the thrashing, and issuing a SIGCONT resumes the thrash. >>>>> >>>>> I've looked at iostat output both before & during pg_basebackup runs, >>>>> and I'm not seeing any indication that the problem is due to disk IO >>>>> bottlenecks. The numbers don't vary very much at all between the good >>>>> & bad times. This is typical when pg_basebackup is running: >>>>> ######## >>>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>> md0 >>>>> 0.00 0.00 67.76 68.62 4.42 1.46 >>>>> 88.34 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> ######## >>>>> >>>>> and this is when the system is ok: >>>>> ######## >>>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>> md0 >>>>> 0.00 0.00 68.04 68.56 4.44 1.46 >>>>> 88.39 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> ######## >>>>> >>>>> >>>>> I looked at vmstat output, but nothing is jumping out at me as being >>>>> dramatically different when pg_basebackup is running. swap in and >>>>> swap out are zero 100% of the time for the good & bad perf cases. I >>>>> can post example output if someone is interested, or if there's >>>>> something specific that I should be looking at as a potential problem, >>>>> let me know. >>>> >>>> Did you set synchronous_standby_names to '*'? If so, the problem you >>>> encountered can happen. >>>> >>>> When synchronous_standby_names is '*', you cannot control which >>>> standbys take a role of synchronous standby. The standby which you >>>> expect to run as asynchronous one might be synchronous one. So >>>> my guess is that at first one of your three standbys was running as >>>> synchronous standby, and all queries were executed normally. But >>>> when you started pg_basebackup, pg_basebackup unexpectedly >>>> got the role of synchronous standby from another standby. Since >>>> pg_basebackup doesn't send the information about replication >>>> progress back to the master, all queries (more precisely, transaction >>>> commit) got stuck, and kept waiting for the reply from synchronous >>>> standby. >>>> >>>> You can avoid this problem by setting synchronous_standby_names >>>> to the names of your standbys instead of '*'. >>> >>> I don't have synchronous_standby_names set at all. I'm only doing >>> asynchronous replication. >> >> Hmm... I have no idea about what happened on your environment, for now. >> Could you show me the self-contained test case? > > I'm running the following, which gets piped over ssh to a remote > server (at gigabit ethernet speed): > pg_basebackup -v -D - -x -Ft -U postgres > > One thing that I've discovered is that if I throttle back the speed of > what is getting piped to the remote server, that directly correlates > to the load on the server. That seems to indicate that you're overloading the I/O system... Or the CPU, but more likely I/O. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
pgsql-admin by date: