Thread: ExclusiveLock

ExclusiveLock

From
Simon Riggs
Date:
Recent runs of DBT-2 show very occasional ExclusiveLock (s) being held
by transactions, sometimes waiting to be granted.

On Sat, Nov 06, 2004 at 11:40:49AM +0000, Simon Riggs wrote:
> > The lockstats just show there's all those Exclusive Locks on
order_line, right?:
>
>     http://www.osdl.org/projects/dbt2dev/results/dev4-010/191/db/lockstats.out
> > 
> 
> The output is...
>     relname    |  pid  |       mode       | granted 
> ---------------+-------+------------------+---------
>  new_order     | 21735 | AccessShareLock  | t
>  new_order     | 21735 | RowExclusiveLock | t
>  orders        | 21715 | AccessShareLock  | t
>  orders        | 21715 | RowExclusiveLock | t
>  pg_class      | 23254 | AccessShareLock  | t
>  order_line    | 21715 | AccessShareLock  | t
>  order_line    | 21715 | RowExclusiveLock | t
>  order_line    | 21735 | ExclusiveLock    | f
>  new_order     | 21715 | AccessShareLock  | t
...
> 
> which shows a non-granted lock, waiting for a Table-level
ExclusiveLock
> on order_line. This is unexpected 
(by me, that is...)

According to the manual, Exclusive Lock is not normally held by SQL
statements. There are no LOCK TABLE statements in DBT-2.

My digging reveals that ExclusiveLock is held on user relations by_bt_getbuf() - when we extend a btree relation by one
page

I also find ExclusiveLock is held by
- LISTEN/NOTIFY
- XactLockTableInsert()/XactLockTableDelete()
but those don't look like they lock user relations

LockAcquire() says its locks show in lock tables, so is index extension
the source of the ExclusiveLocks shown in the lock output? Presumably
they would be short duration, so you wouldn't see them unless you caught
it at just the right moment....unless we start to queue up on the
leadingedge of the index.

I expect index extension to be a source of contention anyway, but are we
actually *seeing* it? Or is it another issue, and is this an 8.0
problem?

-- 
Best Regards, Simon Riggs



Re: ExclusiveLock

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Recent runs of DBT-2 show very occasional ExclusiveLock (s) being held
> by transactions, sometimes waiting to be granted.

I think you are right that these reflect heap or btree-index extension
operations.  Those do not actually take locks on the *table* however,
but locks on a single page within it (which are completely orthogonal to
table locks and don't conflict).  The pg_locks output leaves something
to be desired, because you can't tell the difference between table and
page locks.

It's odd that your example does not appear to show someone else holding
a conflicting lock.
        regards, tom lane


Re: ExclusiveLock

From
Simon Riggs
Date:
On Mon, 2004-11-08 at 21:37, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Recent runs of DBT-2 show very occasional ExclusiveLock (s) being held
> > by transactions, sometimes waiting to be granted.
> 
> I think you are right that these reflect heap or btree-index extension
> operations.  Those do not actually take locks on the *table* however,
> but locks on a single page within it (which are completely orthogonal to
> table locks and don't conflict).  The pg_locks output leaves something
> to be desired, because you can't tell the difference between table and
> page locks.

Good. Thought it was worth discussion...

> It's odd that your example does not appear to show someone else holding
> a conflicting lock.

There is....I didn't copy the whole lock table output...here it is...
  relname    |  pid  |       mode       | granted 
---------------+-------+------------------+---------new_order     | 21735 | AccessShareLock  | tnew_order     | 21735 |
RowExclusiveLock| torders        | 21715 | AccessShareLock  | torders        | 21715 | RowExclusiveLock | tpg_class
| 23254 | AccessShareLock  | torder_line    | 21715 | AccessShareLock  | torder_line    | 21715 | RowExclusiveLock |
torder_line   | 21735 | ExclusiveLock    | fnew_order     | 21715 | AccessShareLock  | tnew_order     | 21715 |
RowExclusiveLock| tcustomer      | 21715 | AccessShareLock  | tpk_order_line | 21735 | AccessShareLock  |
tpk_order_line| 21735 | RowExclusiveLock | titem          | 21715 | AccessShareLock  | torders        | 21735 |
AccessShareLock | torders        | 21735 | RowExclusiveLock | torder_line    | 21735 | AccessShareLock  | torder_line
| 21735 | RowExclusiveLock | tstock         | 21715 | AccessShareLock  | tstock         | 21715 | RowExclusiveLock |
torder_line   | 21715 | ExclusiveLock    | tpk_order_line | 21715 | RowExclusiveLock | tpg_locks      | 23254 |
AccessShareLock | tdistrict      | 21715 | AccessShareLock  | tdistrict      | 21715 | RowShareLock     | tdistrict
| 21715 | RowExclusiveLock | twarehouse     | 21715 | AccessShareLock  | tcustomer      | 21735 | AccessShareLock  |
tcustomer     | 21735 | RowExclusiveLock | t
 
(29 rows)


Pids 21715 and 21735 are conflicting.

There's also another example where the lock table output is > 1400 rows,
with two lock requests pending.

The oprofile for this run looks like this: (but is not of course a
snapshot at a point in time, like the lock list)

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        app name                 symbol name
170746   42.7220  vmlinux-2.6.8.1-osdl2    ia64_pal_call_static
18934     4.7374  libc-2.3.2.so            (no symbols)
10691     2.6750  postgres                 FunctionCall2
9814      2.4555  postgres                 hash_seq_search
8654      2.1653  postgres                 SearchCatCache
7389      1.8488  postgres                 AllocSetAlloc
6122      1.5318  postgres                 hash_search
5707      1.4279  postgres                 OpernameGetCandidates
4901      1.2263  postgres                 StrategyDirtyBufferList
4627      1.1577  postgres                 XLogInsert
4424      1.1069  postgres                 pglz_decompress
4371      1.0937  vmlinux-2.6.8.1-osdl2    __copy_user
3796      0.9498  vmlinux-2.6.8.1-osdl2    finish_task_switch
3483      0.8715  postgres                 LWLockAcquire
3458      0.8652  postgres                 eqjoinsel
3001      0.7509  vmlinux-2.6.8.1-osdl2    get_exec_dcookie
2824      0.7066  postgres                 AtEOXact_CatCache
2745      0.6868  postgres                 _bt_compare
2730      0.6831  postgres                 nocachegetattr
2715      0.6793  postgres                 SearchCatCacheList
2659      0.6653  postgres                 MemoryContextAllocZeroAligned
2604      0.6515  postgres                 yyparse
2553      0.6388  postgres                 eqsel
2127      0.5322  postgres                 deconstruct_array
1921      0.4806  postgres                 hash_any
1919      0.4801  postgres                 int4eq
1855      0.4641  postgres                 LWLockRelease
1839      0.4601  postgres                 StrategyBufferLookup
1777      0.4446  postgres                 GetSnapshotData
1729      0.4326  postgres                 heap_getsysattr
1595      0.3991  postgres                 DLMoveToFront
1586      0.3968  postgres                 MemoryContextAlloc
1485      0.3716  vmlinux-2.6.8.1-osdl2    try_atomic_semop
1455      0.3641  postgres                 anonymous symbol from section .plt
1409      0.3525  postgres                 lappend
1352      0.3383  postgres                 heap_release_fetch
1270      0.3178  postgres                 PinBuffer
1141      0.2855  postgres                 DirectFunctionCall1
1132      0.2832  postgres                 base_yylex
982       0.2457  postgres                 pgstat_initstats
957       0.2394  vmlinux-2.6.8.1-osdl2    __make_request
926       0.2317  postgres                 AllocSetFree
892       0.2232  vmlinux-2.6.8.1-osdl2    try_to_wake_up
874       0.2187  postgres                 _bt_checkkeys
870       0.2177  postgres                 fmgr_isbuiltin
853       0.2134  postgres                 ReadBufferInternal
852       0.2132  postgres                 pfree
850       0.2127  postgres                 _bt_moveright
848       0.2122  vmlinux-2.6.8.1-osdl2    do_cciss_request
766       0.1917  postgres                 ExecTargetList
734       0.1837  postgres                 SearchSysCache
730       0.1827  postgres                 PGSemaphoreLock
706       0.1766  postgres                 expression_tree_walker
684       0.1711  postgres                 ExecEvalVar
674       0.1686  postgres                 StrategyGetBuffer
669       0.1674  postgres                 ResourceOwnerForgetCatCacheRef
660       0.1651  postgres                 lcons
614       0.1536  vmlinux-2.6.8.1-osdl2    find_get_page
586       0.1466  postgres                 _bt_restscan
582       0.1456  postgres                 MemoryContextAllocZero
551       0.1379  postgres                 LockRelease
551       0.1379  postgres                 heap_formtuple
540       0.1351  postgres                 OidFunctionCall3
537       0.1344  postgres                 check_stack_depth
527       0.1319  postgres                 ExecutePlan
521       0.1304  postgres                 CatalogCacheComputeHashValue
510       0.1276  postgres                 buildRelationAliases
508       0.1271  vmlinux-2.6.8.1-osdl2    find_get_pages_tag
504       0.1261  postgres                 btgettuple
499       0.1249  postgres                 IndexNext
454       0.1136  postgres                 ExecInitExpr
453       0.1133  postgres                 ExecProcNode
447       0.1118  postgres                 LockAcquire

I note that an important one has dropped down the list:
1        2.5e-04  postgres                 AtEOXact_Buffers

and this is nowhere now...                   UnlockBuffers

StrategyDirtyBufferList is too high, so we can change that.

As a follow-on: We've got freelists for reuse of space. Do freelists
work for index/heap extension also, or does everybody read the same info
to get the next block....i.e. we are space conservative, rather than
emphasising concurrency? It would be good to have one freelist per
CPU....

-- 
Best Regards, Simon Riggs



Re: [Testperf-general] Re: ExclusiveLock

From
Josh Berkus
Date:
Tom,

> I think you are right that these reflect heap or btree-index extension
> operations.  Those do not actually take locks on the *table* however,
> but locks on a single page within it (which are completely orthogonal to
> table locks and don't conflict).  The pg_locks output leaves something
> to be desired, because you can't tell the difference between table and
> page locks.

Aside from foriegn keys, though, is there any way in which INSERT page locks 
could block other inserts?    I have another system (Lyris) where that 
appears to be happening with 32 concurrent INSERT streams.    It's possible 
that the problem is somewhere else, but I'm disturbed by the possibility.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: [Testperf-general] Re: ExclusiveLock

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Aside from foriegn keys, though, is there any way in which INSERT page locks 
> could block other inserts?

Not for longer than the time needed to physically add a tuple to a page.
        regards, tom lane


Re: [Testperf-general] Re: ExclusiveLock

From
Simon Riggs
Date:
On Thu, 2004-11-18 at 22:12, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > Aside from foriegn keys, though, is there any way in which INSERT page locks 
> > could block other inserts?
> 
> Not for longer than the time needed to physically add a tuple to a page.

The main problem on INSERTs is that it is usually the same few pages:
the lead data block and the lead index block. There are ways of
spreading the load out across an index, but I'm not sure what happens on
the leading edge of the data relation, but I think it hits the same
block each time.

Only an issue if you have more than one CPU...

-- 
Best Regards, Simon Riggs



Re: [Testperf-general] Re: ExclusiveLock

From
Josh Berkus
Date:
Simon, Tom,

> The main problem on INSERTs is that it is usually the same few pages:
> the lead data block and the lead index block. There are ways of
> spreading the load out across an index, but I'm not sure what happens on
> the leading edge of the data relation, but I think it hits the same
> block each time.

I actually have several test cases for this, can you give me a trace or 
profile suggestion that would show if this is happening?

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: [Testperf-general] Re: ExclusiveLock

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> The main problem on INSERTs is that it is usually the same few pages:
> the lead data block and the lead index block. There are ways of
> spreading the load out across an index, but I'm not sure what happens on
> the leading edge of the data relation, but I think it hits the same
> block each time.

FSM does what it can to spread the insertion load across multiple pages,
but of course this is not going to help much unless your table has lots
of embedded free space.  I think it would work pretty well on a table
with lots of update turnover, but not on an INSERT-only workload.
        regards, tom lane


Re: [Testperf-general] Re: ExclusiveLock

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> The main problem on INSERTs is that it is usually the same few pages:
>> the lead data block and the lead index block. There are ways of
>> spreading the load out across an index, but I'm not sure what happens on
>> the leading edge of the data relation, but I think it hits the same
>> block each time.

> I actually have several test cases for this, can you give me a trace or 
> profile suggestion that would show if this is happening?

If it is a problem, the LockBuffer calls in RelationGetBufferForTuple
would be the places showing contention delays.

It could also be that the contention is for the WALInsertLock, ie, the
right to stuff a WAL record into the shared buffers.  This effect would
be the same even if you were inserting into N separate tables.
        regards, tom lane


Re: [Testperf-general] Re: ExclusiveLock

From
Simon Riggs
Date:
On Thu, 2004-11-18 at 22:51, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > The main problem on INSERTs is that it is usually the same few pages:
> > the lead data block and the lead index block. There are ways of
> > spreading the load out across an index, but I'm not sure what happens on
> > the leading edge of the data relation, but I think it hits the same
> > block each time.
> 
> FSM does what it can to spread the insertion load across multiple pages,
> but of course this is not going to help much unless your table has lots
> of embedded free space.  I think it would work pretty well on a table
> with lots of update turnover, but not on an INSERT-only workload.

OK, thats what I thought.

So with a table with an INSERT-only workload, the FSM is always empty,
so there only ever is one block that gets locked. That means we can't
ever go faster than 1 CPU can go - any other CPUs will just wait for the
block lock. [In Josh's case, 32 INSERT streams won't go significantly
faster than about 4 streams, allowing for some overlap of other
operations]

Would it be possible to: when a new block is allocated from the relation
file (rather than reused), we check the FSM - if it is empty, then we
allocate 8 new blocks and add them all to the FSM. The next few
INSERTers will then use the FSM blocks normally.

Doing that will definitely speed up DBT-2 and many other workloads. Many
tables have SERIAL defined, or use a monotonically increasing unique
key.

-- 
Best Regards, Simon Riggs



Re: [Testperf-general] Re: ExclusiveLock

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Would it be possible to: when a new block is allocated from the relation
> file (rather than reused), we check the FSM - if it is empty, then we
> allocate 8 new blocks and add them all to the FSM. The next few
> INSERTers will then use the FSM blocks normally.

Most likely that would just shift the contention to the WALInsertLock.
        regards, tom lane


Re: [Testperf-general] Re: ExclusiveLock

From
Simon Riggs
Date:
On Thu, 2004-11-18 at 23:19, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Would it be possible to: when a new block is allocated from the relation
> > file (rather than reused), we check the FSM - if it is empty, then we
> > allocate 8 new blocks and add them all to the FSM. The next few
> > INSERTers will then use the FSM blocks normally.
> 
> Most likely that would just shift the contention to the WALInsertLock.

Well, removing any performance bottleneck shifts the bottleneck to
another place, though that is not an argument against removing it.

Can we subdivide the WALInsertLock so there are multiple entry points to
wal_buffers, based upon hashing the xid? That would allow wal to be
written sequentially by each transaction though slightly out of order
for different transactions. Commit/Abort would all go through the same
lock to guarantee serializability. 

-- 
Best Regards, Simon Riggs



Re: [Testperf-general] Re: ExclusiveLock

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Can we subdivide the WALInsertLock so there are multiple entry points to
> wal_buffers, based upon hashing the xid?

I don't think so; WAL is inherently a linear log.  (Awhile ago there was
some talk of nonlinear log writing to get around the one-commit-per-
disk-revolution syndrome, but the idea basically got rejected as
unworkably complicated.)  What's more, there are a lot of entries that
must remain time-ordered independently of transaction ownership.
Consider btree index page splits and sequence nextvals for two examples.

Certainly I'd not buy into any such project without incontrovertible
proof that it would solve a major bottleneck --- and right now we are
only speculating with no evidence.
        regards, tom lane


Re: [Testperf-general] Re: ExclusiveLock

From
Simon Riggs
Date:
On Thu, 2004-11-18 at 22:55, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> >> The main problem on INSERTs is that it is usually the same few pages:
> >> the lead data block and the lead index block. There are ways of
> >> spreading the load out across an index, but I'm not sure what happens on
> >> the leading edge of the data relation, but I think it hits the same
> >> block each time.
> 
> > I actually have several test cases for this, can you give me a trace or 
> > profile suggestion that would show if this is happening?
> 
> If it is a problem, the LockBuffer calls in RelationGetBufferForTuple
> would be the places showing contention delays.

You say this as if we can easily check that. My understanding is that
this would require a scripted gdb session to instrument the executable
at that point.

Is that what you mean? That isn't typically regarded as a great thing to
do on a production system. 

You've mentioned about performance speculation, which I agree with, but
what are the alternatives? Compile-time changes aren't usually able to
be enabled, since many people from work RPMs.

> It could also be that the contention is for the WALInsertLock, ie, the
> right to stuff a WAL record into the shared buffers.  This effect would
> be the same even if you were inserting into N separate tables.

...and how do we check that also.

Are we back to simulated workloads and fully rigged executables?

-- 
Best Regards, Simon Riggs



Re: [Testperf-general] Re: ExclusiveLock

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Thu, 2004-11-18 at 22:55, Tom Lane wrote:
>> If it is a problem, the LockBuffer calls in RelationGetBufferForTuple
>> would be the places showing contention delays.

> You say this as if we can easily check that.

I think this can be done with oprofile ...
        regards, tom lane


Re: [Testperf-general] Re: ExclusiveLock

From
Simon Riggs
Date:
On Sat, 2004-11-20 at 16:14, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Thu, 2004-11-18 at 22:55, Tom Lane wrote:
> >> If it is a problem, the LockBuffer calls in RelationGetBufferForTuple
> >> would be the places showing contention delays.
> 
> > You say this as if we can easily check that.
> 
> I think this can be done with oprofile ...

OK, well thats where this thread started.

oprofile only tells us aggregate information. It doesn't tell us how
much time is spent waiting because of contention issues, it just tells
us how much time is spent and even that is skewed.

There really ought to be a better way to instrument things from inside,
based upon knowledge of the code.

-- 
Best Regards, Simon Riggs



Re: [Testperf-general] Re: ExclusiveLock

From
Simon Riggs
Date:
On Thu, 2004-11-18 at 23:54, Tom Lane wrote:
> I don't think so; WAL is inherently a linear log.  (Awhile ago there was
> some talk of nonlinear log writing to get around the one-commit-per-
> disk-revolution syndrome, but the idea basically got rejected as
> unworkably complicated.)  

...this appears to still be on the TODO list... should it be removed?

- Find a way to reduce rotational delay when repeatedly writing last WAL
page 

Currently fsync of WAL requires the disk platter to perform a full
rotation to fsync again. One idea is to write the WAL to different
offsets that might reduce the rotational delay. 

-- 
Best Regards, Simon Riggs



Re: [Testperf-general] Re: ExclusiveLock

From
Greg Stark
Date:
Simon Riggs <simon@2ndquadrant.com> writes:

> - Find a way to reduce rotational delay when repeatedly writing last WAL
> page 
> 
> Currently fsync of WAL requires the disk platter to perform a full
> rotation to fsync again. One idea is to write the WAL to different
> offsets that might reduce the rotational delay. 

Once upon a time when you formatted hard drives you actually gave them an
interleave factor for a similar reason. These days you invariably use an
interleave of 1, ie, store the blocks continuously. Whether that's because
controllers have become fast enough to keep up with the burst rate or because
the firmware is smart enough to handle the block interleaving invisibly isn't
clear to me.

I wonder if formatting the drive to have an interleave >1 would actually
improve performance of the WAL log. 

It would depend a lot on the usage pattern though. A heavily used system might
be able to generate enough WAL traffic to keep up with the burst rate of the
drive. And an less used system might benefit but might lose.

Probably now the less than saturated system gets close to the average
half-rotation-time latency. This idea would only really help if you have a
system that happens to be triggering pessimal results worse than that due to
unfortunate timing.

-- 
greg



Re: [Testperf-general] Re: ExclusiveLock

From
Simon Riggs
Date:
On Mon, 2004-11-22 at 23:37, Greg Stark wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> 
> > - Find a way to reduce rotational delay when repeatedly writing last WAL
> > page 
> > 
> > Currently fsync of WAL requires the disk platter to perform a full
> > rotation to fsync again. One idea is to write the WAL to different
> > offsets that might reduce the rotational delay. 
> 
> Once upon a time when you formatted hard drives you actually gave them an
> interleave factor for a similar reason. These days you invariably use an
> interleave of 1, ie, store the blocks continuously. Whether that's because
> controllers have become fast enough to keep up with the burst rate or because
> the firmware is smart enough to handle the block interleaving invisibly isn't
> clear to me.
> 
> I wonder if formatting the drive to have an interleave >1 would actually
> improve performance of the WAL log. 
> 
> It would depend a lot on the usage pattern though. A heavily used system might
> be able to generate enough WAL traffic to keep up with the burst rate of the
> drive. And an less used system might benefit but might lose.
> 
> Probably now the less than saturated system gets close to the average
> half-rotation-time latency. This idea would only really help if you have a
> system that happens to be triggering pessimal results worse than that due to
> unfortunate timing.

I was asking whether that topic should be removed, since Tom had said it
had been rejected....

If you could tell me how to instrument the system to (better) show
whether such plans as you suggest are workable, I would be greatly
interested. Anything we do needs to be able to be monitored for
success/failure.

-- 
Best Regards, Simon Riggs



Re: [Testperf-general] Re: ExclusiveLock

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Mon, 2004-11-22 at 23:37, Greg Stark wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > 
> > > - Find a way to reduce rotational delay when repeatedly writing last WAL
> > > page 
> > > 
> > > Currently fsync of WAL requires the disk platter to perform a full
> > > rotation to fsync again. One idea is to write the WAL to different
> > > offsets that might reduce the rotational delay. 
> > 
> > Once upon a time when you formatted hard drives you actually gave them an
> > interleave factor for a similar reason. These days you invariably use an
> > interleave of 1, ie, store the blocks continuously. Whether that's because
> > controllers have become fast enough to keep up with the burst rate or because
> > the firmware is smart enough to handle the block interleaving invisibly isn't
> > clear to me.
> > 
> > I wonder if formatting the drive to have an interleave >1 would actually
> > improve performance of the WAL log. 
> > 
> > It would depend a lot on the usage pattern though. A heavily used system might
> > be able to generate enough WAL traffic to keep up with the burst rate of the
> > drive. And an less used system might benefit but might lose.
> > 
> > Probably now the less than saturated system gets close to the average
> > half-rotation-time latency. This idea would only really help if you have a
> > system that happens to be triggering pessimal results worse than that due to
> > unfortunate timing.
> 
> I was asking whether that topic should be removed, since Tom had said it
> had been rejected....

The method used to fix it was rejected, but the goal of making it better
is still a valid one.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: [Testperf-general] Re: ExclusiveLock

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Once upon a time when you formatted hard drives you actually gave them an
> interleave factor for a similar reason. These days you invariably use an
> interleave of 1, ie, store the blocks continuously. Whether that's because
> controllers have become fast enough to keep up with the burst rate or because
> the firmware is smart enough to handle the block interleaving invisibly isn't
> clear to me.

The impression I had was that disk drives no longer pay the slightest
attention to interleave specs, because the logical model implied by the
concept is too far removed from modern reality (on-disk buffering,
variable numbers of sectors per track, transparently remapped bad
sectors, yadda yadda).

And that's just at the hardware level ... who knows where the filesystem
is putting your data, or what the kernel I/O scheduler is doing with
your requests :-(

Basically I see the TODO item as a blue-sky research topic, not
something we have any idea how to implement.  That doesn't mean it can't
be on the TODO list ...
        regards, tom lane


Re: [Testperf-general] Re: ExclusiveLock

From
Kenneth Marshall
Date:
On Tue, Nov 23, 2004 at 12:04:17AM +0000, Simon Riggs wrote:
> On Mon, 2004-11-22 at 23:37, Greg Stark wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > 
> > > - Find a way to reduce rotational delay when repeatedly writing last WAL
> > > page 
> > > 
> > > Currently fsync of WAL requires the disk platter to perform a full
> > > rotation to fsync again. One idea is to write the WAL to different
> > > offsets that might reduce the rotational delay. 
> > 
> > Once upon a time when you formatted hard drives you actually gave them an
> > interleave factor for a similar reason. These days you invariably use an
> > interleave of 1, ie, store the blocks continuously. Whether that's because
> > controllers have become fast enough to keep up with the burst rate or because
> > the firmware is smart enough to handle the block interleaving invisibly isn't
> > clear to me.
> > 
> > I wonder if formatting the drive to have an interleave >1 would actually
> > improve performance of the WAL log. 
> > 
> > It would depend a lot on the usage pattern though. A heavily used system might
> > be able to generate enough WAL traffic to keep up with the burst rate of the
> > drive. And an less used system might benefit but might lose.
> > 
> > Probably now the less than saturated system gets close to the average
> > half-rotation-time latency. This idea would only really help if you have a
> > system that happens to be triggering pessimal results worse than that due to
> > unfortunate timing.
> 
> I was asking whether that topic should be removed, since Tom had said it
> had been rejected....
> 
> If you could tell me how to instrument the system to (better) show
> whether such plans as you suggest are workable, I would be greatly
> interested. Anything we do needs to be able to be monitored for
> success/failure.
> 
> -- 
> Best Regards, Simon Riggs
> 

The disk performance has increased so much that the reasons for having
an interleave factor other than 1 (no interleaving) have all but disappeared.
CPU speed has also increased so much relative to disk speed that using some
CPU cycles to improve I/O is a reasonable approach. I have been considering
how this might be accomplished. As Simon so aptly pointed out, we need to
show that it materially affects the performance or it is not worth doing.
The simplest idea I had was to pre-layout the WAL logs in a contiguous fashion
on the disk. Solaris has this ability given appropriate FS parameters and we
should be able to get close on most other OSes. Once that has happened, use
something like the FSM map to show the allocated blocks. The CPU can keep track
of its current disk rotational position (approx. is okay) then when we need to
write a WAL block start writing at the next area that the disk head will be
sweeping. Give it a little leaway for latency in the system and we should be
able to get very low latency for the writes. Obviously, there would be wasted
space but you could intersperse writes to the granularity of space overhead
that you would like to see. As far as implementation, I was reading an
interesting article that used a simple theoretical model to estimate disk head
position to avoid latency.

Yours truly,
Ken Marshall