Thread: Fix overflow of bgwriter's request queue

Fix overflow of bgwriter's request queue

From
ITAGAKI Takahiro
Date:
Attached is a patch that fixes overflow of bgwriter's file-fsync request queue.

It happened on heavy update workloads and the performance decreased.
I have sent HACKERS the detail.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

Attachment

Re: Fix overflow of bgwriter's request queue

From
"Qingqing Zhou"
Date:
"ITAGAKI Takahiro" <itagaki.takahiro@lab.ntt.co.jp> wrote
>
> Attached is a patch that fixes overflow of bgwriter's file-fsync request
> queue.
>

   while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) !=
NULL)
   {
+   if (i >= count)
+    elog(ERROR, "pendingOpsTable corrupted");
+
+   memcpy(&entries[i++], entry, sizeof(PendingOperationEntry));
+
+   if (hash_search(pendingOpsTable, entry,
+       HASH_REMOVE, NULL) == NULL)
+    elog(ERROR, "pendingOpsTable corrupted");
+  }

What's the rationale of this change?

Regards,
Qingqing



Re: Fix overflow of bgwriter's request queue

From
ITAGAKI Takahiro
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote:

>    while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
>    {
> +   if (i >= count)
> +    elog(ERROR, "pendingOpsTable corrupted");
> +
> +   memcpy(&entries[i++], entry, sizeof(PendingOperationEntry));
> +
> +   if (hash_search(pendingOpsTable, entry,
> +       HASH_REMOVE, NULL) == NULL)
> +    elog(ERROR, "pendingOpsTable corrupted");
> +  }
>
> What's the rationale of this change?

AbsorbFsyncRequests will be called during the fsync loop in my patch,
so new files might be added to pendingOpsTable and they will be removed
from the table *before* writing the pages belonging to them.
So I changed it to copy the contents of pendingOpsTable to a local
variables and iterate on the vars later.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories



Re: Fix overflow of bgwriter's request queue

From
"Qingqing Zhou"
Date:
"ITAGAKI Takahiro" <itagaki.takahiro@lab.ntt.co.jp> wrote
>
> AbsorbFsyncRequests will be called during the fsync loop in my patch,
> so new files might be added to pendingOpsTable and they will be removed
> from the table *before* writing the pages belonging to them.
> So I changed it to copy the contents of pendingOpsTable to a local
> variables and iterate on the vars later.
>

I see - it is the AbsorbFsyncRequests() added in mdsync() loop and you want
to avoid unecessary fsyncs. But the remove-recover method you use has a
caveat: if any hash_search(HASH_ENTER) failed when you try to reinsert them
into the pendingOpsTable, you have to raise the error to PANIC since we
can't get back the missing fds any more.

Regards,
Qingqing



Re: Fix overflow of bgwriter's request queue

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> "ITAGAKI Takahiro" <itagaki.takahiro@lab.ntt.co.jp> wrote
>> AbsorbFsyncRequests will be called during the fsync loop in my patch,
>> so new files might be added to pendingOpsTable and they will be removed
>> from the table *before* writing the pages belonging to them.
>> So I changed it to copy the contents of pendingOpsTable to a local
>> variables and iterate on the vars later.

I think this fear is incorrect.  At the time ForwardFsyncRequest is
called, the backend must *already* have done whatever write it is
concerned about fsync'ing (note that ForwardFsyncRequest may choose
to do the fsync itself).  Therefore it is OK if the bgwriter does that
fsync immediately upon receipt of the request.  There is no constraint
saying that we ever need to delay execution of an fsync request.

> I see - it is the AbsorbFsyncRequests() added in mdsync() loop and you want
> to avoid unecessary fsyncs. But the remove-recover method you use has a
> caveat: if any hash_search(HASH_ENTER) failed when you try to reinsert them
> into the pendingOpsTable, you have to raise the error to PANIC since we
> can't get back the missing fds any more.

Yes, the patch is wrong as-is because it may lose uncompleted fsyncs.
But I think that we could just add the AbsorbFsyncRequests call in the
fsync loop and not worry about trying to avoid doing extra fsyncs.

Another possibility is to make the copied list as in the patch, but
HASH_REMOVE an entry only after doing the fsync successfully --- as long
as you don't AbsorbFsyncRequests between doing the fsync and removing
the entry, you aren't risking missing a necessary fsync.  I'm
unconvinced that this is worth the trouble, however.

            regards, tom lane

Re: Fix overflow of bgwriter's request queue

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> Yes, the patch is wrong as-is because it may lose uncompleted fsyncs.
> But I think that we could just add the AbsorbFsyncRequests call in the
> fsync loop and not worry about trying to avoid doing extra fsyncs.
>
> Another possibility is to make the copied list as in the patch, but
> HASH_REMOVE an entry only after doing the fsync successfully --- as long
> as you don't AbsorbFsyncRequests between doing the fsync and removing
> the entry, you aren't risking missing a necessary fsync.  I'm
> unconvinced that this is worth the trouble, however.
>

Maybe the take a copied list is safer. I got a little afraid of doing
seqscan hash while doing HASH_ENTER at the same time. Do we have this kind
of hash usage somewhere?

Regards,
Qingqing



Re: Fix overflow of bgwriter's request queue

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> Maybe the take a copied list is safer. I got a little afraid of doing
> seqscan hash while doing HASH_ENTER at the same time. Do we have this kind
> of hash usage somewhere?

Sure, it's perfectly safe.  It's unspecified whether the scan will visit
such entries or not (because it might or might not already have passed
their hash bucket), but per above discussion we don't really care.

            regards, tom lane

Re: Fix overflow of bgwriter's request queue

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> > "ITAGAKI Takahiro" <itagaki.takahiro@lab.ntt.co.jp> wrote
> >> AbsorbFsyncRequests will be called during the fsync loop in my patch,
> >> so new files might be added to pendingOpsTable and they will be removed
> >> from the table *before* writing the pages belonging to them.
>
> I think this fear is incorrect.  At the time ForwardFsyncRequest is
> called, the backend must *already* have done whatever write it is
> concerned about fsync'ing.

Oops, I was wrong. Also, I see that there is no necessity for fearing
endless loops because hash-seqscan and HASH_ENTER don't conflict.

Attached is a revised patch. It became very simple, but I worry that
one magic number (BUFFERS_PER_ABSORB) is still left.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories


Attachment

Re: Fix overflow of bgwriter's request queue

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> Attached is a revised patch. It became very simple, but I worry that
> one magic number (BUFFERS_PER_ABSORB) is still left.

Have you checked that this version of the patch fixes the problem you
saw originally?  Does the problem come back if you change
BUFFERS_PER_ABSORB to too large a value?  If you can identify a
threshold where the problem reappears in your test case, that would help
us choose the right value to use.

I suspect it'd probably be sufficient to absorb requests every few times
through the fsync loop, too, if you want to experiment with that.

            regards, tom lane

Re: Fix overflow of bgwriter's request queue

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> > Attached is a revised patch. It became very simple, but I worry that
> > one magic number (BUFFERS_PER_ABSORB) is still left.
>
> Have you checked that this version of the patch fixes the problem you
> saw originally?  Does the problem come back if you change
> BUFFERS_PER_ABSORB to too large a value?

The problem on my machine was resolved by this patch. I tested it and
logged the maximum of BgWriterShmem->num_requests for each checkpoint.
Test condition was:
  - shared_buffers = 65536
  - connections = 30
The average of maximums was 25857 and the while max was 31807.
They didn't exceed the max_requests(= 65536).

> I suspect it'd probably be sufficient to absorb requests every few times
> through the fsync loop, too, if you want to experiment with that.

In the above test, smgrsync took 50 sec for syncing 32 files. This means
absorb are requested every 1.5 sec, which is less frequent than absorbs by
normal activity of bgwriter (bgwriter_delay=200ms). So I assume absorb
requests the fsync loop would not be a problem.


BUFFERS_PER_ABSORB = 10 (absorb per 1/10 of shared_buffers) is enough at least
on my machine, but it doesn't necessarily work well in all environments.
If we need to set BUFFERS_PER_ABSORB to a reasonably value, I think the number
of active backends might be useful; for example, half of num of backends.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories



Re: Fix overflow of bgwriter's request queue

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I suspect it'd probably be sufficient to absorb requests every few times
>> through the fsync loop, too, if you want to experiment with that.

> In the above test, smgrsync took 50 sec for syncing 32 files. This means
> absorb are requested every 1.5 sec, which is less frequent than absorbs by
> normal activity of bgwriter (bgwriter_delay=200ms).

That seems awfully high to me --- 1.5 sec to fsync a segment file that
is never larger than 1Gb, and probably usually has much less than 1Gb
of dirty data?  I think you must have been testing an atypical case.

I've applied the attached modified version of your patch.  In this
coding, absorbs are done after every 1000 buffer writes in BufferSync
and after every 10 fsyncs in mdsync.  We may need to twiddle these
numbers but it seems at least in the right ballpark.  If you have time
to repeat your original test and see how this does, it'd be much
appreciated.

            regards, tom lane


Attachment

Re: Fix overflow of bgwriter's request queue

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I've applied the attached modified version of your patch.  In this
> coding, absorbs are done after every 1000 buffer writes in BufferSync
> and after every 10 fsyncs in mdsync.  We may need to twiddle these
> numbers but it seems at least in the right ballpark.  If you have time
> to repeat your original test and see how this does, it'd be much
> appreciated.

Thank you. It worked well on my machine(*).
Undesirable behavior was not seen.

(*)
TPC-C(DBT-2)
RHEL4 U1 (2.6.9-11)
XFS, 8 S-ATA disks / 8GB memory(shmem=512MB)

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories