Thread: Re: [PATCHES] Fix mdsync never-ending loop problem

Re: [PATCHES] Fix mdsync never-ending loop problem

From

ITAGAKI Takahiro

Date:

06 April 2007, 06:05:41

Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> Itagaki, would you like to take a stab at this?

Yes, I'll try to fix the mdsync problem. I'll separate this fix from LDC
patch. If we need to backport the fix to the back branches, a stand-alone
patch would be better.

In my understanding from the discussion, we'd better to take "cycle ID"
approach instead of "making a copy of pendingOpsTable", because duplicated
table is hard to debug and requires us to pay attention not to leak memories.
I'll adopt the cycle ID approach and build LDC on it as a separate patch.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: [PATCHES] Fix mdsync never-ending loop problem

From

Tom Lane

Date:

06 April 2007, 06:37:28

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> In my understanding from the discussion, we'd better to take "cycle ID"
> approach instead of "making a copy of pendingOpsTable", because duplicated
> table is hard to debug and requires us to pay attention not to leak memories.
> I'll adopt the cycle ID approach and build LDC on it as a separate patch.

Heikki made some reasonable arguments against the cycle-ID idea.  I'm
not intending to insist on it ...

I do think there are multiple issues here and it'd be better to try
to separate the fixes into different patches.
        regards, tom lane

Re: [PATCHES] Fix mdsync never-ending loop problem

From

Tom Lane

Date:

10 April 2007, 18:41:17

I wrote:
> This patch looks fairly sane to me; I have a few small gripes about
> coding style but that can be fixed while applying.  Heikki, you were
> concerned about the cycle-ID idea; do you have any objection to this
> patch?

Actually, on second look I think the key idea here is Takahiro-san's
introduction of a cancellation flag in the hashtable entries, to
replace the cases where AbsorbFsyncRequests can try to delete entries.

What that means is mdsync() doesn't need an outer retry loop at all:
the periodic AbsorbFsyncRequests calls are not a hazard, and retry of
FileSync failures can be handled as an inner loop on the single failing
table entry.  (We can make the failure counter a local variable, too,
instead of needing space in every hashtable entry.)

And with that change, it's no longer possible for an incoming stream
of fsync requests to keep mdsync from terminating.  It might fsync
more than it really needs to, but it won't repeat itself, and it must
reach the end of the hashtable eventually.  So we don't actually need
the cycle counter at all.

It might be worth having the cycle counter anyway just to avoid doing
"useless" fsync work.  I'm not sure about this.  If we have a cycle
counter of say 32 bits, then it's theoretically possible for an fsync
to fail 2^32 consecutive times and then be skipped on the next try,
allowing a checkpoint to succeed that should not have.  We can fix that
with a few more lines of logic to detect a wrapped-around value, but is
it worth the trouble?

            regards, tom lane

Re: [PATCHES] Fix mdsync never-ending loop problem

From

ITAGAKI Takahiro

Date:

10 April 2007, 19:24:06

(Sorry if you receive duplicate messages. I resend it since it was not
 delivered after a day.)

Here is another patch to fix never-ending loop in mdsync. I introduced
a mdsync counter (cycle id) and cancel flags to fix the problem.

The mdsync counter is incremented at the every beginning of mdsync().
Each pending entry has a field assigned from the counter when it is
newly inserted to pendingOpsTable. Only entries that have smaller counter
values than the mdsync counter are fsync-ed in mdsync().

Another change is to add a cancel flag in each pending entry. When a
relation is dropped and bgwriter receives a forget-request, the corresponding
entry is marked as dropped but we don't delete it at that time. Actual
deletion is performed in the next fsync loop. We don't have to retry after
AbsorbFsyncRequests() because entries are not removed outside of seqscan.

This patch can be applied to HEAD, 8.2 and 8.1 with a few hunks.

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > In my understanding from the discussion, we'd better to take "cycle ID"
> > approach instead of "making a copy of pendingOpsTable", because duplicated
> > table is hard to debug and requires us to pay attention not to leak memories.
> > I'll adopt the cycle ID approach and build LDC on it as a separate patch.
>
> Heikki made some reasonable arguments against the cycle-ID idea.  I'm
> not intending to insist on it ...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

fix_mdsync.patch

Re: [PATCHES] Fix mdsync never-ending loop problem

From

Tom Lane

Date:

10 April 2007, 19:24:09

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> Here is another patch to fix never-ending loop in mdsync. I introduced
> a mdsync counter (cycle id) and cancel flags to fix the problem.

> The mdsync counter is incremented at the every beginning of mdsync().
> Each pending entry has a field assigned from the counter when it is
> newly inserted to pendingOpsTable. Only entries that have smaller counter
> values than the mdsync counter are fsync-ed in mdsync().

> Another change is to add a cancel flag in each pending entry. When a
> relation is dropped and bgwriter receives a forget-request, the corresponding
> entry is marked as dropped but we don't delete it at that time. Actual
> deletion is performed in the next fsync loop. We don't have to retry after
> AbsorbFsyncRequests() because entries are not removed outside of seqscan.

This patch looks fairly sane to me; I have a few small gripes about
coding style but that can be fixed while applying.  Heikki, you were
concerned about the cycle-ID idea; do you have any objection to this
patch?

> This patch can be applied to HEAD, 8.2 and 8.1 with a few hunks.

I don't think we should back-patch something that's a performance fix
for an extreme case, especially not when it's not been through any
extensive testing yet ...
        regards, tom lane