Thread: measuring lwlock-related latency spikes

measuring lwlock-related latency spikes

From
Robert Haas
Date:
On Fri, Mar 30, 2012 at 1:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Actually, what is really bugging me is that I cannot find any way of
> getting a profile that reflects the *time* spent waiting rather than
> merely the *number* of waits.  That seems like an obvious thing to
> want, and I cannot find a way to get it.

I ended up taking a stab at writing some userspace code for this - see
attached.  It measures the amount of time taken for each contended
lwlock acquisition and prints a crude histogram of the results at
postmaster-shutdown time.  At least on the IBM POWER7 machine, running
pgbench with 32 threads, where pg_test_timing shows that getting the
time takes less than a microsecond 96%+ of the time, this seemed to
have no real impact on the tps numbers - perhaps because the workload
is I/O bound.  Risking the possible ire of people who object to large
attachments, I'm attaching the results this generated on a 30-minute,
32-thread pgbench run at scale factor 300.  To minimize said ire, I've
run the file through bzip2.

What's interesting about this is that, while there is plenty of
waiting for the usual suspects - ProcArrayLock (4), WALInsertLock (7),
and CLogControlLock (11), the waits are all pretty short:

2012-03-31 02:33:25 UTC [50305] LOG:  lock 0: 2:520, 3:2222
2012-03-31 02:33:25 UTC [50305] LOG:  lock 3: 1:9, 2:36838
2012-03-31 02:33:25 UTC [50305] LOG:  lock 4: 1:33, 2:216964
2012-03-31 02:33:25 UTC [50305] LOG:  lock 7: 1:39, 2:406249
2012-03-31 02:33:25 UTC [50305] LOG:  lock 8: 1:4, 2:34
2012-03-31 02:33:25 UTC [50305] LOG:  lock 11: 1:99, 2:374559
2012-03-31 02:33:25 UTC [50305] LOG:  lock 17: 2:24, 3:24

That's saying that there were over 400,000 contended acquisitions of
WALInsertLock, but the longest one had fls(wait_time_in_us) = 2, or in
other words it took less than 4us to get the lock.  So what happens if
we grep the log file for the biggest offenders?

2012-03-31 02:33:25 UTC [50305] LOG:  lock 204610: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 272958: 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 325412: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 325784: 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 360016: 5:1, 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 444886: 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 499890: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 533418: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 610484: 6:1, 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 897798: 22:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1027128: 7:1, 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1074256: 5:1, 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1132586: 5:1, 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1312178: 16:1, 22:1

If I'm reading this right, fls(wait_time_in_us) = 20 means a 1 second
delay, which I think means that a couple of those waits were >= *8
seconds*.  gdb reveals that in the test configuration, all odd
numbered locks between 169 and 2097319 are some buffer's
io_in_progress_lock, and all even numbered locks between 170 and
2097320 are some buffer's content_lock, which means, if I'm not
confused here, that every single lwlock-related stall > 1s happened
while waiting for a buffer content lock.  Moreover, each event
affected a different buffer.  I find this result so surprising that I
have a hard time believing that I haven't screwed something up, so if
anybody can check over the patch and this analysis and suggest what
that thing might be, I would appreciate it.  I would be a lot less
surprised if the waits revolved around the IO-in-progress locks, since
it's not that hard to imagine an I/O taking a really long time on a
busy system.  But I didn't think we were ever supposed to hold content
locks for that long.

The other thing that baffles me about these numbers is that they don't
provide any fodder for explaining the periodic drops in throughput
that happen when the system checkpoints.  I had assumed they would
show up as long lwlock waits, like somebody hanging on to
WALInsertLock while everybody else in the system piles up behind them.
 That's not reflected in these numbers - the few very long waits show
just ONE guy waiting a really long time for the lock.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: measuring lwlock-related latency spikes

From
Simon Riggs
Date:
On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> which means, if I'm not
> confused here, that every single lwlock-related stall > 1s happened
> while waiting for a buffer content lock.  Moreover, each event
> affected a different buffer.  I find this result so surprising that I
> have a hard time believing that I haven't screwed something up, so if
> anybody can check over the patch and this analysis and suggest what
> that thing might be, I would appreciate it.

Possible candidates are

1) pages on the RHS of the PK index on accounts. When the page splits
a new buffer will be allocated and the contention will move to the new
buffer. Given so few stalls, I'd say this was the block one above leaf
level.

2) Buffer writes hold the content lock in shared mode, so a delayed
I/O during checkpoint on a page requested by another for write would
show up as a wait for a content lock. That might happen to updates
where checkpoint write occurs between the search and write portions of
the update.

The next logical step in measuring lock waits is to track the reason
for the lock wait, not just the lock wait itself.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: measuring lwlock-related latency spikes

From
Robert Haas
Date:
On Sat, Mar 31, 2012 at 4:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The next logical step in measuring lock waits is to track the reason
> for the lock wait, not just the lock wait itself.

I had the same thought.  I'm not immediately sure what the best way to
do that is, but I'll see if I can figure something out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: measuring lwlock-related latency spikes

From
Greg Stark
Date:
On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> But I didn't think we were ever supposed to hold content
> locks for that long.

Isn't that lock held while doing visibility checks? What about I/O
waiting for a clog page to be read?

-- 
greg


Re: measuring lwlock-related latency spikes

From
Simon Riggs
Date:
On Sat, Mar 31, 2012 at 1:58 PM, Greg Stark <stark@mit.edu> wrote:
> On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> But I didn't think we were ever supposed to hold content
>> locks for that long.
>
> Isn't that lock held while doing visibility checks? What about I/O
> waiting for a clog page to be read?

So what we should be logging is the list of lwlocks held when the lock
wait occurred.

That would differentiate call paths somewhat better than just looking
at the current lock request.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: measuring lwlock-related latency spikes

From
Robert Haas
Date:
On Sat, Mar 31, 2012 at 8:58 AM, Greg Stark <stark@mit.edu> wrote:
> On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> But I didn't think we were ever supposed to hold content
>> locks for that long.
>
> Isn't that lock held while doing visibility checks?

Nope.  heap_update() and friends do a very complicated little dance to
avoid that.  Heikki articulated that rule when he installed the
visibility map in 8.4, and I had to work pretty hard to preserve it in
9.2 when I did the work to make the visibility map crash-safe, but now
I'm glad I did.

> What about I/O
> waiting for a clog page to be read?

I'm pretty sure that can happen, because TransactionIdIsCommitted()
can get called from HeapTupleSatisfies*() which pretty much only gets
called while holding the page lock.  I don't know whether it's the
cause of these particular stalls, but it's plausible if the CLOG cache
is getting thrashed hard enough.

I did discover one systematic error in my testing methodology: I only
instrumented LWLockAcquire(), not LWLockAcquireOrWait().  The latter
turns out to be an important case in this instance since we use that
when flushing WAL.  Just running the regression tests on my laptop
suggests that with that oversight corrected, WALWriteLock is going to
pop out as a huge source of latency spikes.  But I will know for sure
after I do a more formal test run.  I also modified the code to print
out debugging output every time we have to wait for a long time (I
think I've got it set to 10ms right now, but I might raise that if
it's too verbose in a real test run) with the file and line number
attempting the lock acquisition that blocked, and the file and line
number that had most recently acquired the lock at the time we first
discovered we needed to wait.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: measuring lwlock-related latency spikes

From
Greg Stark
Date:
On Sat, Mar 31, 2012 at 10:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Isn't that lock held while doing visibility checks?
>
> Nope.  heap_update() and friends do a very complicated little dance to
> avoid that.
...
>> What about I/O
>> waiting for a clog page to be read?
>
> I'm pretty sure that can happen

I'm confused because i thought these two sentences were part of
describing the same case.

> because TransactionIdIsCommitted()
> can get called from HeapTupleSatisfies*() which pretty much only gets
> called while holding the page lock.  I don't know whether it's the
> cause of these particular stalls, but it's plausible if the CLOG cache
> is getting thrashed hard enough.

I wonder if it would make sense to, if we come across an xid that
isn't in the slru release the lock while we read in the clog page.
When we reobtain it we can check if the LSN has changed and if it has
restart the visibility checks. If it hasn't pick up where we left off.
--
greg


Re: measuring lwlock-related latency spikes

From
Robert Haas
Date:
On Sat, Mar 31, 2012 at 6:01 PM, Greg Stark <stark@mit.edu> wrote:
> On Sat, Mar 31, 2012 at 10:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Isn't that lock held while doing visibility checks?
>>
>> Nope.  heap_update() and friends do a very complicated little dance to
>> avoid that.
> ...
>>> What about I/O
>>> waiting for a clog page to be read?
>>
>> I'm pretty sure that can happen
>
> I'm confused because i thought these two sentences were part of
> describing the same case.

Oh, I thought you were talking about the visibility *map*.  Sorry.

>> because TransactionIdIsCommitted()
>> can get called from HeapTupleSatisfies*() which pretty much only gets
>> called while holding the page lock.  I don't know whether it's the
>> cause of these particular stalls, but it's plausible if the CLOG cache
>> is getting thrashed hard enough.
>
> I wonder if it would make sense to, if we come across an xid that
> isn't in the slru release the lock while we read in the clog page.
> When we reobtain it we can check if the LSN has changed and if it has
> restart the visibility checks. If it hasn't pick up where we left off.

I've discovered a bug in my code that was causing it to print at most
2 histogram buckets per lwlock, which obviously means that my previous
results were totally inaccurate.  Ah, the joys of benchmarking.  I
found the problem when I added code to log a message any time an
lwlock wait exceeded a certain time threshold, and it fired far more
often than the previous results would have indicated.  In particular,
it turns out that long waits for WALInsertLock are extremely common
and not, as the previous results appeared to indicated, unheard-of.
I'm rerunning my tests now and will post the updated,
hopefully-accurate results when that's done.

/me attempts to remove egg from face.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: measuring lwlock-related latency spikes

From
Robert Haas
Date:
On Sat, Mar 31, 2012 at 9:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I've discovered a bug in my code that was causing it to print at most
> 2 histogram buckets per lwlock, which obviously means that my previous
> results were totally inaccurate.  Ah, the joys of benchmarking.  I
> found the problem when I added code to log a message any time an
> lwlock wait exceeded a certain time threshold, and it fired far more
> often than the previous results would have indicated.  In particular,
> it turns out that long waits for WALInsertLock are extremely common
> and not, as the previous results appeared to indicated, unheard-of.
> I'm rerunning my tests now and will post the updated,
> hopefully-accurate results when that's done.
>
> /me attempts to remove egg from face.

All right, so with the aforementioned bug fixed (see attached, revised
patch), there are now massive latency spikes popping out all over the
place: on my latest run, there were 377 distinct lwlocks that took >=
1s to acquire on at least one occasion during this 30-minute run.
Some of those locks, of course, had more than one problem event.  In
total, somebody waited >= 1 s for a lock 4897 times during this test.
These break down as follows.  Note that the "blocked by" is the person
who had most recently acquired the lock as of the start of the wait,
and is not necessarily solely responsible for the full duration of the
wait due to shared locks and queue jumping.

      1 waited at heapam.c:1651 blocked by bufmgr.c:2475
      1 waited at heapam.c:2844 blocked by heapam.c:2758
      1 waited at hio.c:335 blocked by heapam.c:1651
      1 waited at hio.c:336 blocked by hio.c:336
      1 waited at indexam.c:521 blocked by hio.c:345
      1 waited at xlog.c:2090 blocked by xlog.c:2090
      2 waited at bufmgr.c:1671 blocked by bufmgr.c:2475
      2 waited at indexam.c:521 blocked by heapam.c:3464
      2 waited at nbtpage.c:650 blocked by nbtinsert.c:124
      2 waited at xlog.c:1502 blocked by xlog.c:2090
      2 waited at xlog.c:2241 blocked by xlog.c:1502
      3 waited at slru.c:310 blocked by slru.c:404
      4 waited at indexam.c:521 blocked by hio.c:335
      4 waited at indexam.c:521 blocked by hio.c:336
      4 waited at xlog.c:2241 blocked by xlog.c:2090
      6 waited at hio.c:336 blocked by heapam.c:2758
     12 waited at indexam.c:521 blocked by bufmgr.c:2475
     20 waited at xlog.c:2090 blocked by xlog.c:2241
     26 waited at heapam.c:2758 blocked by indexam.c:521
     29 waited at heapam.c:2758 blocked by heapam.c:2758
     80 waited at xlog.c:1502 blocked by xlog.c:2241
     89 waited at indexam.c:521 blocked by heapam.c:2758
    115 waited at varsup.c:65 blocked by varsup.c:65
   1540 waited at slru.c:310 blocked by slru.c:526
   2948 waited at xlog.c:909 blocked by xlog.c:909

xlog.c:909 is the LWLockAcquire of WALInsertLock from within
XLogInsert.  slru.c:310 is in SimpleLruWaitIO(), where we attempt to
grab the SLRU buffer lock in shared mode after releasing the control
lock.  slru.c:526 is in SlruInternalWritePage(), where we hold the
buffer lock while writing the page.  This is commit
194b5ea3d0722f94e8a6ba9cec03b858cc8c9370, if you want to look up an of
the other line numbers.

If I filter for waits greater than 8s, a somewhat different picture emerges:

      2 waited at indexam.c:521 blocked by bufmgr.c:2475
    212 waited at slru.c:310 blocked by slru.c:526

In other words, some of the waits for SLRU pages to be written are...
really long.  There were 126 that exceeded 10 seconds and 56 that
exceeded 12 seconds.  "Painful" is putting it mildly.

I suppose one interesting question is to figure out if there's a way I
can optimize the disk configuration in this machine, or the Linux I/O
scheduler, or something, so as to reduce the amount of time it spends
waiting for the disk.  But the other thing is why we're waiting for
SLRU page writes to begin with.  My guess based on previous testing is
that what's happening here is (1) we examine a tuple on an old page
and decide we must look up its XID, (2) the relevant CLOG page isn't
in cache so we decide to read it, but (3) the page we decide to evict
happens to be dirty, so we have to write it first.  That sure seems
like something that a smart background writer ought to be able to fix
for us.  Simon previously posted a patch for that:

http://archives.postgresql.org/pgsql-hackers/2012-01/msg00571.php

...but the testing I did at the time didn't seem to show a real clear benefit:

http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results

The obvious question here is: was that a problem with the patch, or a
problem with my testing methodology, or is it just that the
performance characteristics of the machine I used for that test (Nate
Boley's 32-core AMD box) were different from this one (IBM POWER7)?  I
don't know, and I think I'm out of time to play with this for this
weekend, but I'll investigate further when time permits.  Other
thoughts welcome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: measuring lwlock-related latency spikes

From
Simon Riggs
Date:
On Sun, Apr 1, 2012 at 4:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> If I filter for waits greater than 8s, a somewhat different picture emerges:
>
>      2 waited at indexam.c:521 blocked by bufmgr.c:2475
>    212 waited at slru.c:310 blocked by slru.c:526
>
> In other words, some of the waits for SLRU pages to be written are...
> really long.  There were 126 that exceeded 10 seconds and 56 that
> exceeded 12 seconds.  "Painful" is putting it mildly.

Interesting. The total wait contribution from those two factors
exceeds the WALInsertLock wait.

> I suppose one interesting question is to figure out if there's a way I
> can optimize the disk configuration in this machine, or the Linux I/O
> scheduler, or something, so as to reduce the amount of time it spends
> waiting for the disk.  But the other thing is why we're waiting for
> SLRU page writes to begin with.

First, we need to determine that it is the clog where this is happening.

Also, you're assuming this is an I/O issue. I think its more likely
that this is a lock starvation issue. Shared locks queue jump
continually over the exclusive lock, blocking access for long periods.

I would guess that is also the case with the index wait, where I would
guess a near-root block needs an exclusive lock, but is held up by
continual index tree descents.

My (fairly old) observation is that the shared lock semantics only
work well when exclusive locks are fairly common. When they are rare,
the semantics work against us.

We should either implement 1) non-queue jump semantics for certain
cases 2) put a limit on the number of queue jumps that can occur
before we let the next x lock proceed instead. (2) sounds better, but
keeping track might well cause greater overhead.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: measuring lwlock-related latency spikes

From
Robert Haas
Date:
On Sun, Apr 1, 2012 at 7:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> First, we need to determine that it is the clog where this is happening.

I can confirm that based on the LWLockIds.  There were 32 of them
beginning at lock id 81, and a gdb session confirms that
ClogCtlData->shared->buffer_locks[0..31] point to exact that set of
LWLockIds.

> Also, you're assuming this is an I/O issue. I think its more likely
> that this is a lock starvation issue. Shared locks queue jump
> continually over the exclusive lock, blocking access for long periods.

That is a possible issue in general, but I can't see how it could be
happening here, because the shared lock is only a mechanism for
waiting for an I/O to complete.  The backend doing the I/O grabs the
control lock, sets a flag saying there's an I/O in progress, takes the
buffer lock in exclusive mode, and releases the control lock.  The
shared locks are taken when someone notices that the flag is set on a
buffer they want to access.  So there aren't any shared lockers until
the buffer is already locked in exclusive mode.  Or at least I don't
think there are; please correct me if I'm wrong.

Now... I do think it's possible that this could happen: backend #1
wants to write the buffer, so grabs the lock and writes the buffer.
Meanwhile some waiters pile up.  When the guy doing the I/O finishes,
he releases the lock, releasing all the waiters.  They then have to
wake up and grab the lock, but maybe before they (or some of them) can
do it somebody else starts another I/O on the buffer and they all have
to go back to sleep.  That could allow the wait time to be many times
the I/O time.  If that's the case we could just make this use
LWLockAcquireOrWait(); the calling code is just going to pick a new
victim buffer anyway, so it's silly to go through additional spinlock
cycles to acquire a lock we don't want anyway.

I bet I can add some more instrumentation to get clearer data on what
is happening here.  What I've added so far doesn't seem to be
affecting performance very much.

> I would guess that is also the case with the index wait, where I would
> guess a near-root block needs an exclusive lock, but is held up by
> continual index tree descents.
>
> My (fairly old) observation is that the shared lock semantics only
> work well when exclusive locks are fairly common. When they are rare,
> the semantics work against us.
>
> We should either implement 1) non-queue jump semantics for certain
> cases 2) put a limit on the number of queue jumps that can occur
> before we let the next x lock proceed instead. (2) sounds better, but
> keeping track might well cause greater overhead.

Maybe, but your point that we should characterize the behavior before
engineering solutions is well-taken, so let's do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: measuring lwlock-related latency spikes

From
Simon Riggs
Date:
On Sun, Apr 1, 2012 at 1:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Apr 1, 2012 at 7:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> First, we need to determine that it is the clog where this is happening.
>
> I can confirm that based on the LWLockIds.  There were 32 of them
> beginning at lock id 81, and a gdb session confirms that
> ClogCtlData->shared->buffer_locks[0..31] point to exact that set of
> LWLockIds.
>
>> Also, you're assuming this is an I/O issue. I think its more likely
>> that this is a lock starvation issue. Shared locks queue jump
>> continually over the exclusive lock, blocking access for long periods.
>
> That is a possible issue in general, but I can't see how it could be
> happening here, because the shared lock is only a mechanism for
> waiting for an I/O to complete.  The backend doing the I/O grabs the
> control lock, sets a flag saying there's an I/O in progress, takes the
> buffer lock in exclusive mode, and releases the control lock.  The
> shared locks are taken when someone notices that the flag is set on a
> buffer they want to access.  So there aren't any shared lockers until
> the buffer is already locked in exclusive mode.  Or at least I don't
> think there are; please correct me if I'm wrong.

Agreed.

Before the exclusive lock holder releases the lock it must acquire the
control lock in exclusive mode (line 544).

So lock starvation on the control lock would cause a long wait after
each I/O, making it look like an I/O problem.

Anyway, just to note that it might not be I/O and we need to find out.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: measuring lwlock-related latency spikes

From
Greg Stark
Date:
On Sun, Apr 1, 2012 at 10:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> So lock starvation on the control lock would cause a long wait after
> each I/O, making it look like an I/O problem.

Except that both of the locks involved in his smoking gun occur
*after* the control lock has already been acquired. The one that's
actually being blocked for a long time is in fact acquiring a shared
lock which the queue jumping couldn't be hurting.

We know you're convinced about the queue jumping being a problem, and
it's definitely a plausible problem, but I think you need exactly the
kind of instrumentation Robert is doing here to test that theory.
Without it even if everyone agreed it was a real problem we would have
no idea whether a proposed change fixed it.


Fwiw this instrumentation is *amazing*. As a user this kind of rare
random stall is precisely the kind of thing that totally kills me. I
would so much rather run a web site on a database where each query
took twice as long but it guaranteed that no query would take over a
second than one that was twice as fast on average but occasionally
gets stuck for 12s.


-- 
greg


Re: measuring lwlock-related latency spikes

From
Greg Stark
Date:
On Sun, Apr 1, 2012 at 4:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> My guess based on previous testing is
> that what's happening here is (1) we examine a tuple on an old page
> and decide we must look up its XID, (2) the relevant CLOG page isn't
> in cache so we decide to read it, but (3) the page we decide to evict
> happens to be dirty, so we have to write it first.

Reading the code one possibility is that in the time we write the
oldest slru page another process has come along and redirtied it. So
we pick a new oldest slru page and write that. By the time we've
written it another process could have redirtied it again. On a loaded
system where the writes are taking 100ms or more it's conceivable --
barely -- that could happen over and over again hundreds of times.

In general the locking and reasoning about concurrent attempts to read
pages here makes my head swim. It looks like even if there's a lot of
contention for the same page or same slot it shouldn't manifest itself
that way but it seems like the kind of logic with multiple locks and
retries that is prone to priority inversion type problems. I wonder if
more detailed instrumentation showing the sequence of operations taken
while holding a lock that somebody got stuck on would help.

-- 
greg


Re: measuring lwlock-related latency spikes

From
"Albe Laurenz"
Date:
Robert Haas wrote:
> I suppose one interesting question is to figure out if there's a way I
> can optimize the disk configuration in this machine, or the Linux I/O
> scheduler, or something, so as to reduce the amount of time it spends
> waiting for the disk.

I'd be curious to know if using the deadline scheduler will improve
things.  I have experienced pretty bad performance with cfq under
load, where sequential table scans were starved to the point where
they took hours instead of less than a minute (on an idle system).
But I believe that also depends a lot on the storage system used.

Yours,
Laurenz Albe


Re: measuring lwlock-related latency spikes

From
Simon Riggs
Date:
On Sun, Apr 1, 2012 at 11:12 PM, Greg Stark <stark@mit.edu> wrote:
> On Sun, Apr 1, 2012 at 10:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> So lock starvation on the control lock would cause a long wait after
>> each I/O, making it look like an I/O problem.
>
> Except that both of the locks involved in his smoking gun occur
> *after* the control lock has already been acquired. The one that's
> actually being blocked for a long time is in fact acquiring a shared
> lock which the queue jumping couldn't be hurting.

Not true, please refer to code at line 544, as I already indicated.

My understanding of the instrumentation is that the lock acquired at
line 526 will show as the blocker until we reach line 555, so anything
in between could be responsible for the wait.

(As long as there are multiple possibilities, I will remain convinced
that the cause could be any of them.)

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: measuring lwlock-related latency spikes

From
Simon Riggs
Date:
On Mon, Apr 2, 2012 at 12:00 AM, Greg Stark <stark@mit.edu> wrote:
> On Sun, Apr 1, 2012 at 4:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> My guess based on previous testing is
>> that what's happening here is (1) we examine a tuple on an old page
>> and decide we must look up its XID, (2) the relevant CLOG page isn't
>> in cache so we decide to read it, but (3) the page we decide to evict
>> happens to be dirty, so we have to write it first.
>
> Reading the code one possibility is that in the time we write the
> oldest slru page another process has come along and redirtied it. So
> we pick a new oldest slru page and write that. By the time we've
> written it another process could have redirtied it again. On a loaded
> system where the writes are taking 100ms or more it's conceivable --
> barely -- that could happen over and over again hundreds of times.

That's a valid concern but I don't think the instrumentation would
show that as a single long wait because the locks would be released
and be retaken each time around the loop - I guess that's for Robert
to explain how it would show up.

If it doesn't show it, then the actual max wait time could be even higher. ;-(

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: measuring lwlock-related latency spikes

From
Greg Stark
Date:
On Mon, Apr 2, 2012 at 8:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Not true, please refer to code at line 544, as I already indicated.
>
> My understanding of the instrumentation is that the lock acquired at
> line 526 will show as the blocker until we reach line 555, so anything
> in between could be responsible for the wait.

Hm, but then wouldn't the lock acquisition at line 544 be showing up as well?


-- 
greg


Re: measuring lwlock-related latency spikes

From
Simon Riggs
Date:
On Mon, Apr 2, 2012 at 11:49 AM, Greg Stark <stark@mit.edu> wrote:
> On Mon, Apr 2, 2012 at 8:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Not true, please refer to code at line 544, as I already indicated.
>>
>> My understanding of the instrumentation is that the lock acquired at
>> line 526 will show as the blocker until we reach line 555, so anything
>> in between could be responsible for the wait.
>
> Hm, but then wouldn't the lock acquisition at line 544 be showing up as well?

Some time ago on this thread, I wrote:
"Anyway, just to note that it might not be I/O and we need to find out."

Do you consider this proof that it can only be I/O? Or do we still
need to find out?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: measuring lwlock-related latency spikes

From
Robert Haas
Date:
On Mon, Apr 2, 2012 at 7:01 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Do you consider this proof that it can only be I/O? Or do we still
> need to find out?

I stuck a bunch more debugging instrumentation into the SLRU code.  It
was fairly clear from the previous round of instrumentation that the
problem was that there was a lot of time passing between when
SlruInternalWritePage acquires shared->buffer_locks[slotno] and when
it releases that lock; I added some additional instrumentation to (a)
confirm this and (b) further break down where the time is getting
spent.  Long story short, when a CLOG-related stall happens,
essentially all the time is being spent in this here section of code:
   /*    * If not part of Flush, need to fsync now.  We assume this happens    * infrequently enough that it's not a
performanceissue.    */   if (!fdata)   {       if (ctl->do_fsync && pg_fsync(fd))       {           slru_errcause =
SLRU_FSYNC_FAILED;          slru_errno = errno;           close(fd);           return false;       }
 
       if (close(fd))       {           slru_errcause = SLRU_CLOSE_FAILED;           slru_errno = errno;
returnfalse;       }   }
 

Here's what the debug output looks like:

2012-04-02 15:51:27 UTC [62397] LOG:  SlruPhysicalWritePage(11)
intervals: 0.005000 0.001000 0.013000 0.000000 0.073000 13162.557000
2012-04-02 15:51:27 UTC [62397] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -3060 WHERE aid = 6501332;
2012-04-02 15:51:27 UTC [62430] LOG:  lock 104: waited 13162.676 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62430] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -3692 WHERE aid = 2516692;
2012-04-02 15:51:27 UTC [62428] LOG:  lock 104: waited 13161.409 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62428] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 3281 WHERE aid = 24527957;
2012-04-02 15:51:27 UTC [62443] LOG:  lock 104: waited 13161.146 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62443] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -360 WHERE aid = 6714054;
2012-04-02 15:51:27 UTC [62436] LOG:  lock 104: waited 12094.996 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62436] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -49 WHERE aid = 4080528;
2012-04-02 15:51:27 UTC [62389] LOG:  lock 104: waited 13160.966 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62389] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -563 WHERE aid = 21896604;
2012-04-02 15:51:27 UTC [62407] LOG:  lock 104: waited 13161.034 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62407] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 1437 WHERE aid = 17185681;
2012-04-02 15:51:27 UTC [62432] LOG:  lock 104: waited 13160.983 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62432] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 4330 WHERE aid = 6289956;
2012-04-02 15:51:27 UTC [62403] LOG:  lock 104: waited 11953.875 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62403] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -4717 WHERE aid = 18829978;
2012-04-02 15:51:27 UTC [62438] LOG:  lock 104: waited 11953.987 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62438] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 1361 WHERE aid = 26274208;
2012-04-02 15:51:27 UTC [62400] LOG:  lock 104: waited 10471.223 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62400] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -2002 WHERE aid = 19209246;
2012-04-02 15:51:27 UTC [62427] LOG:  lock 104: waited 10248.041 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62427] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -874 WHERE aid = 4042895;
2012-04-02 15:51:27 UTC [62419] LOG:  lock 104: waited 13161.085 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62419] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -2874 WHERE aid = 11997038;
2012-04-02 15:51:27 UTC [62394] LOG:  lock 104: waited 10171.179 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62394] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -3855 WHERE aid = 12744804;
2012-04-02 15:51:27 UTC [62410] LOG:  lock 104: waited 10247.882 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62410] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 3643 WHERE aid = 16152613;
2012-04-02 15:51:27 UTC [62440] LOG:  lock 104: waited 10169.646 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62440] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 215 WHERE aid = 3276253;
2012-04-02 15:51:27 UTC [62423] LOG:  lock 104: waited 10170.909 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62423] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 2846 WHERE aid = 10337813;
2012-04-02 15:51:27 UTC [62431] LOG:  lock 104: waited 10279.255 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62431] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -419 WHERE aid = 1547227;
2012-04-02 15:51:27 UTC [62397] LOG:  SlruInternalWritePage(11)
intervals: 0.007000 13162.736000 0.000000 0.001000 0.257000
2012-04-02 15:51:27 UTC [62397] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -3060 WHERE aid = 6501332;
2012-04-02 15:51:27 UTC [62424] LOG:  lock 104: waited 10169.181 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62424] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 4771 WHERE aid = 6768873;
2012-04-02 15:51:27 UTC [62426] LOG:  lock 104: waited 10169.890 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62426] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 3644 WHERE aid = 13548862;
2012-04-02 15:51:27 UTC [62441] LOG:  lock 104: waited 10078.669 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62441] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -2318 WHERE aid = 3387689;
2012-04-02 15:51:27 UTC [62434] LOG:  lock 104: waited 10472.164 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62434] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 1884 WHERE aid = 11738348;
2012-04-02 15:51:27 UTC [62442] LOG:  lock 104: waited 10472.116 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62442] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 1314 WHERE aid = 1871577;
2012-04-02 15:51:27 UTC [62429] LOG:  lock 104: waited 12094.703 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62429] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 1188 WHERE aid = 21727162;
2012-04-02 15:51:27 UTC [62425] LOG:  lock 104: waited 10471.408 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62425] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -35 WHERE aid = 16735415;
2012-04-02 15:51:27 UTC [62435] LOG:  lock 104: waited 10247.717 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62435] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + 457 WHERE aid = 20513750;
2012-04-02 15:51:27 UTC [62421] LOG:  lock 104: waited 12094.718 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62421] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -1989 WHERE aid = 26993577;
2012-04-02 15:51:27 UTC [62433] LOG:  lock 104: waited 10471.231 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62433] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -4369 WHERE aid = 17652000;
2012-04-02 15:51:27 UTC [62439] LOG:  lock 104: waited 10170.579 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62439] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -2453 WHERE aid = 27871868;
2012-04-02 15:51:27 UTC [62415] LOG:  lock 104: waited 10170.867 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62415] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -3110 WHERE aid = 28608274;
2012-04-02 15:51:27 UTC [62422] LOG:  lock 104: waited 10082.338 ms at
slru.c:311 blocked by slru.c:529 spin 2
2012-04-02 15:51:27 UTC [62422] STATEMENT:  UPDATE pgbench_accounts
SET abalance = abalance + -3942 WHERE aid = 12669792;
2012-04-02 15:51:27 UTC [62437] LOG:  lock 104: waited 11955.005 ms at
slru.c:311 blocked by slru.c:529 spin 2

This particular example shows the above chunk of code taking >13s to
execute.  Within 3s, every other backend piles up behind that, leading
to the database getting no work at all done for a good ten seconds.

My guess is that what's happening here is that one backend needs to
read a page into CLOG, so it calls SlruSelectLRUPage to evict the
oldest SLRU page, which is dirty.  For some reason, that I/O takes a
long time.  Then, one by one, other backends comes along and also need
to read various SLRU pages, but the oldest SLRU page hasn't changed,
so SlruSelectLRUPage keeps returning the exact same page that it
returned before, and everybody queues up waiting for that I/O, even
though there might be other buffers available that aren't even dirty.

I am thinking that SlruSelectLRUPage() should probably do
SlruRecentlyUsed() on the selected buffer before calling
SlruInternalWritePage, so that the next backend that comes along
looking for a buffer doesn't pick the same one.  Possibly we should go
further and try to avoid replacing dirty buffers in the first place,
but sometimes there may be no choice, so doing SlruRecentlyUsed() is
still a good idea.

I'll do some testing to try to confirm whether this theory is correct
and whether the above fix helps.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: measuring lwlock-related latency spikes

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> This particular example shows the above chunk of code taking >13s
> to execute.  Within 3s, every other backend piles up behind that,
> leading to the database getting no work at all done for a good ten
> seconds.
> 
> My guess is that what's happening here is that one backend needs
> to read a page into CLOG, so it calls SlruSelectLRUPage to evict
> the oldest SLRU page, which is dirty.  For some reason, that I/O
> takes a long time.  Then, one by one, other backends comes along
> and also need to read various SLRU pages, but the oldest SLRU page
> hasn't changed, so SlruSelectLRUPage keeps returning the exact
> same page that it returned before, and everybody queues up waiting
> for that I/O, even though there might be other buffers available
> that aren't even dirty.
> 
> I am thinking that SlruSelectLRUPage() should probably do
> SlruRecentlyUsed() on the selected buffer before calling
> SlruInternalWritePage, so that the next backend that comes along
> looking for a buffer doesn't pick the same one.
That, or something else which prevents this the same page from being
targeted by all processes, sounds like a good idea.
> Possibly we should go further and try to avoid replacing dirty
> buffers in the first place, but sometimes there may be no choice,
> so doing SlruRecentlyUsed() is still a good idea.
I can't help thinking that the "background hinter" I had ideas about
writing would prevent many of the reads of old CLOG pages, taking a
lot of pressure off of this area.  It just occurred to me that the
difference between that idea and having an autovacuum thread which
just did first-pass work on dirty heap pages is slim to none.  I
know how much time good benchmarking can take, so I hesitate to
suggest another permutation, but it might be interesting to see what
it does to the throughput if autovacuum is configured to what would
otherwise be considered insanely aggressive values (just for vacuum,
not analyze).  To give this a fair shot, the whole database would
need to be vacuumed between initial load and the start of the
benchmark.
-Kevin


Re: measuring lwlock-related latency spikes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Long story short, when a CLOG-related stall happens,
> essentially all the time is being spent in this here section of code:

>     /*
>      * If not part of Flush, need to fsync now.  We assume this happens
>      * infrequently enough that it's not a performance issue.
>      */
>     if (!fdata) // fsync and close the file

Seems like basically what you've proven is that this code path *is* a
performance issue, and that we need to think a bit harder about how to
avoid doing the fsync while holding locks.
        regards, tom lane


Re: measuring lwlock-related latency spikes

From
Simon Riggs
Date:
On Mon, Apr 2, 2012 at 8:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Long story short, when a CLOG-related stall happens,
>> essentially all the time is being spent in this here section of code:
>
>>     /*
>>      * If not part of Flush, need to fsync now.  We assume this happens
>>      * infrequently enough that it's not a performance issue.
>>      */
>>     if (!fdata) // fsync and close the file
>
> Seems like basically what you've proven is that this code path *is* a
> performance issue, and that we need to think a bit harder about how to
> avoid doing the fsync while holding locks.

Agreed, though I think it means the fsync is happening on a filesystem
that causes a full system fsync. That time is not normal.

I suggest we optimise that by moving the dirty block into shared
buffers and leaving it as dirty. That way we don't need to write or
fsync at all and the bgwriter can pick up the pieces. So my earlier
patch to get the bgwriter to clean the clog would be superfluous.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: measuring lwlock-related latency spikes

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> I suggest we optimise that by moving the dirty block into shared
> buffers and leaving it as dirty. That way we don't need to write or
> fsync at all and the bgwriter can pick up the pieces. So my earlier
> patch to get the bgwriter to clean the clog would be superfluous.

[ blink... ]  I think you forgot to mention the massive restructuring
needed to cause clog to become a normal relation that the bgwriter and
shared buffer manager would know what to do with.  This might be a good
long-term approach but it's not going to produce any near-term joy.

I note BTW that many years ago, the transaction log *was* a normal
relation file, and the current clog code descends directly from
realizing that that was a bad idea.  If memory serves, the killer
problem was that a standard relation file doesn't support truncation
from the front; but there may have been other issues as well.
        regards, tom lane


Re: measuring lwlock-related latency spikes

From
Greg Stark
Date:
On Mon, Apr 2, 2012 at 8:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Agreed, though I think it means the fsync is happening on a filesystem
> that causes a full system fsync. That time is not normal.

I don't know what you mean. It looks like there are two cases where
this code path executes. Either more than 16 clog files are being
flushed by the SimpleLRUFlush() during a checkpoint or a dirty page is
being evicted by SlruSelectLRUPage().

I don't know that 16 is so crazy a number of clog files to be touching
between checkpoints any more on a big machine like this. The number of
clog files active concurrently in pgbench should be related to how
quickly xids are being used up and how large the database is -- both
of which are pretty big in these tests.  Perhaps the 16 should have
been raised to 32 when CLOGShmemBuffers was raised to 32.

-- 
greg


Re: measuring lwlock-related latency spikes

From
Jeff Janes
Date:
On Mon, Apr 2, 2012 at 12:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Long story short, when a CLOG-related stall happens,
>> essentially all the time is being spent in this here section of code:
>
>>     /*
>>      * If not part of Flush, need to fsync now.  We assume this happens
>>      * infrequently enough that it's not a performance issue.
>>      */
>>     if (!fdata) // fsync and close the file
>
> Seems like basically what you've proven is that this code path *is* a
> performance issue, and that we need to think a bit harder about how to
> avoid doing the fsync while holding locks.

And why is the fsync needed at all upon merely evicting a dirty page
so a replacement can be loaded?

If the system crashes between the write and the (eventual) fsync, you
are in the same position as if the system crashed while the page was
dirty in shared memory.  Either way, you have to be able to recreate
it from WAL, right?


Cheers,

Jeff


Re: measuring lwlock-related latency spikes

From
Robert Haas
Date:
On Apr 2, 2012, at 3:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Seems like basically what you've proven is that this code path *is* a
> performance issue, and that we need to think a bit harder about how to
> avoid doing the fsync while holding locks.

Hmm, good idea. I wonder if we couldn't just hand off the fsync request to the background writer, as we do with buffer
fsyncrequests.  AFAICS we don't need the fsync to happen right away; the next checkpoint cycle should be soon enough. 

...Robert

Re: measuring lwlock-related latency spikes

From
Robert Haas
Date:
On Apr 2, 2012, at 3:16 PM, Simon Riggs <simon@2ndQuadrant.com> wrote:
> Agreed, though I think it means the fsync is happening on a filesystem
> that causes a full system fsync. That time is not normal.

It's ext4, which AFAIK does not have that problem.

> 

...Robert


Re: measuring lwlock-related latency spikes

From
Robert Haas
Date:
On Mon, Apr 2, 2012 at 12:58 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> I can't help thinking that the "background hinter" I had ideas about
> writing would prevent many of the reads of old CLOG pages, taking a
> lot of pressure off of this area.  It just occurred to me that the
> difference between that idea and having an autovacuum thread which
> just did first-pass work on dirty heap pages is slim to none.

Yeah.  Marking things all-visible in the background seems possibly
attractive, too.  I think the trick is to figuring out the control
mechanism.  In this case, the workload fits within shared_buffers, so
it's not helpful to think about using buffer eviction as the trigger
for doing these operations, though that might have some legs in
general.  And a simple revolving scan over shared_buffers doesn't
really figure to work out well either, I suspect, because it's too
undirected.  I think what you'd really like to have is a list of
buffers that were modified by transactions which have recently
committed or rolled back.  Given that, your chance of finding useful
work to do are extremely high.  But it's not clear to me how to make
it happen.  You could have backends remember the last few buffers
they've modified and kick that information over to the background
process via some sort of request queue at commit time, but that seems
more like a nasty benchmarking kludge that something that's likely to
solve real-world problems.

> I know how much time good benchmarking can take, so I hesitate to
> suggest another permutation, but it might be interesting to see what
> it does to the throughput if autovacuum is configured to what would
> otherwise be considered insanely aggressive values (just for vacuum,
> not analyze).  To give this a fair shot, the whole database would
> need to be vacuumed between initial load and the start of the
> benchmark.

If you would like to provide a chunk of settings that I can splat into
postgresql.conf, I'm happy to run 'em through a test cycle and see
what pops out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: measuring lwlock-related latency spikes

From
"Kevin Grittner"
Date:
> Robert Haas  wrote:
> Kevin Grittner  wrote:
>  
>> I can't help thinking that the "background hinter" I had ideas
>> about writing would prevent many of the reads of old CLOG pages,
>> taking a lot of pressure off of this area. It just occurred to me
>> that the difference between that idea and having an autovacuum
>> thread which just did first-pass work on dirty heap pages is slim
>> to none.
>
> Yeah. Marking things all-visible in the background seems possibly
> attractive, too. I think the trick is to figuring out the control
> mechanism. In this case, the workload fits within shared_buffers,
> so it's not helpful to think about using buffer eviction as the
> trigger for doing these operations, though that might have some
> legs in general. And a simple revolving scan over shared_buffers
> doesn't really figure to work out well either, I suspect, because
> it's too undirected. I think what you'd really like to have is a
> list of buffers that were modified by transactions which have
> recently committed or rolled back.
Yeah, that's what I was thinking.  Since we only care about dirty
unhinted tuples, we need some fairly efficient way to track those to
make this pay.
> but that seems more like a nasty benchmarking kludge that something
> that's likely to solve real-world problems.
I'm not so sure.  Unfortunagely, it may be hard to know without
writing at least a crude form of this to test, but there are several
workloads where hint bit rewrites and/or CLOG contention caused by
the slow tapering of usage of old pages contribute to problems.
>> I know how much time good benchmarking can take, so I hesitate to
>> suggest another permutation, but it might be interesting to see
>> what it does to the throughput if autovacuum is configured to what
>> would otherwise be considered insanely aggressive values (just for
>> vacuum, not analyze). To give this a fair shot, the whole database
>> would need to be vacuumed between initial load and the start of
>> the benchmark.
>
> If you would like to provide a chunk of settings that I can splat
> into postgresql.conf, I'm happy to run 'em through a test cycle and
> see what pops out.
Might as well jump in with both feet:
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_scale_factor = 0.0
If that smooths the latency peaks and doesn't hurt performance too
much, it's decent evidence that the more refined technique could be a
win.
-Kevin


Re: measuring lwlock-related latency spikes

From
Robert Haas
Date:
On Tue, Apr 3, 2012 at 8:28 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Might as well jump in with both feet:
>
> autovacuum_naptime = 1s
> autovacuum_vacuum_threshold = 1
> autovacuum_vacuum_scale_factor = 0.0
>
> If that smooths the latency peaks and doesn't hurt performance too
> much, it's decent evidence that the more refined technique could be a
> win.

It seems this isn't good for either throughput or latency.  Here are
latency percentiles for a recent run against master with my usual
settings:

90 1668
91 1747
92 1845
93 1953
94 2064
95 2176
96 2300
97 2461
98 2739
99 3542
100 12955473

And here's how it came out with these settings:

90 1818
91 1904
92 1998
93 2096
94 2200
95 2316
96 2459
97 2660
98 3032
99 3868
100 10842354

tps came out tps = 13658.330709 (including connections establishing),
vs 14546.644712 on the other run.

I have a (possibly incorrect) feeling that even with these
ridiculously aggressive settings, nearly all of the cleanup work is
getting done by HOT prunes rather than by vacuum, so we're still not
testing what we really want to be testing, but we're doing a lot of
extra work along the way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company