Re: measuring lwlock-related latency spikes - Mailing list pgsql-hackers

From Robert Haas
Subject Re: measuring lwlock-related latency spikes
Date
Msg-id CA+TgmobRVFHKt2pdHAJ_WsAF2SfcJdVm3H1KK0scsQ8Z-WoqeA@mail.gmail.com
Whole thread Raw
In response to Re: measuring lwlock-related latency spikes  (Greg Stark <stark@mit.edu>)
Responses Re: measuring lwlock-related latency spikes  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On Sat, Mar 31, 2012 at 8:58 AM, Greg Stark <stark@mit.edu> wrote:
> On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> But I didn't think we were ever supposed to hold content
>> locks for that long.
>
> Isn't that lock held while doing visibility checks?

Nope.  heap_update() and friends do a very complicated little dance to
avoid that.  Heikki articulated that rule when he installed the
visibility map in 8.4, and I had to work pretty hard to preserve it in
9.2 when I did the work to make the visibility map crash-safe, but now
I'm glad I did.

> What about I/O
> waiting for a clog page to be read?

I'm pretty sure that can happen, because TransactionIdIsCommitted()
can get called from HeapTupleSatisfies*() which pretty much only gets
called while holding the page lock.  I don't know whether it's the
cause of these particular stalls, but it's plausible if the CLOG cache
is getting thrashed hard enough.

I did discover one systematic error in my testing methodology: I only
instrumented LWLockAcquire(), not LWLockAcquireOrWait().  The latter
turns out to be an important case in this instance since we use that
when flushing WAL.  Just running the regression tests on my laptop
suggests that with that oversight corrected, WALWriteLock is going to
pop out as a huge source of latency spikes.  But I will know for sure
after I do a more formal test run.  I also modified the code to print
out debugging output every time we have to wait for a long time (I
think I've got it set to 10ms right now, but I might raise that if
it's too verbose in a real test run) with the file and line number
attempting the lock acquisition that blocked, and the file and line
number that had most recently acquired the lock at the time we first
discovered we needed to wait.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Mike Roest
Date:
Subject: Re: [GENERAL] pg_dump incredibly slow dumping a single schema from a large db
Next
From: Peter Geoghegan
Date:
Subject: Re: Publish checkpoint timing and sync files summary data to pg_stat_bgwriter