Thread: Fix overflow of bgwriter's request queue
Attached is a patch that fixes overflow of bgwriter's file-fsync request queue. It happened on heavy update workloads and the performance decreased. I have sent HACKERS the detail. --- ITAGAKI Takahiro NTT Cyber Space Laboratories
Attachment
"ITAGAKI Takahiro" <itagaki.takahiro@lab.ntt.co.jp> wrote > > Attached is a patch that fixes overflow of bgwriter's file-fsync request > queue. > while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) { + if (i >= count) + elog(ERROR, "pendingOpsTable corrupted"); + + memcpy(&entries[i++], entry, sizeof(PendingOperationEntry)); + + if (hash_search(pendingOpsTable, entry, + HASH_REMOVE, NULL) == NULL) + elog(ERROR, "pendingOpsTable corrupted"); + } What's the rationale of this change? Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote: > while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) > { > + if (i >= count) > + elog(ERROR, "pendingOpsTable corrupted"); > + > + memcpy(&entries[i++], entry, sizeof(PendingOperationEntry)); > + > + if (hash_search(pendingOpsTable, entry, > + HASH_REMOVE, NULL) == NULL) > + elog(ERROR, "pendingOpsTable corrupted"); > + } > > What's the rationale of this change? AbsorbFsyncRequests will be called during the fsync loop in my patch, so new files might be added to pendingOpsTable and they will be removed from the table *before* writing the pages belonging to them. So I changed it to copy the contents of pendingOpsTable to a local variables and iterate on the vars later. --- ITAGAKI Takahiro NTT Cyber Space Laboratories
"ITAGAKI Takahiro" <itagaki.takahiro@lab.ntt.co.jp> wrote > > AbsorbFsyncRequests will be called during the fsync loop in my patch, > so new files might be added to pendingOpsTable and they will be removed > from the table *before* writing the pages belonging to them. > So I changed it to copy the contents of pendingOpsTable to a local > variables and iterate on the vars later. > I see - it is the AbsorbFsyncRequests() added in mdsync() loop and you want to avoid unecessary fsyncs. But the remove-recover method you use has a caveat: if any hash_search(HASH_ENTER) failed when you try to reinsert them into the pendingOpsTable, you have to raise the error to PANIC since we can't get back the missing fds any more. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > "ITAGAKI Takahiro" <itagaki.takahiro@lab.ntt.co.jp> wrote >> AbsorbFsyncRequests will be called during the fsync loop in my patch, >> so new files might be added to pendingOpsTable and they will be removed >> from the table *before* writing the pages belonging to them. >> So I changed it to copy the contents of pendingOpsTable to a local >> variables and iterate on the vars later. I think this fear is incorrect. At the time ForwardFsyncRequest is called, the backend must *already* have done whatever write it is concerned about fsync'ing (note that ForwardFsyncRequest may choose to do the fsync itself). Therefore it is OK if the bgwriter does that fsync immediately upon receipt of the request. There is no constraint saying that we ever need to delay execution of an fsync request. > I see - it is the AbsorbFsyncRequests() added in mdsync() loop and you want > to avoid unecessary fsyncs. But the remove-recover method you use has a > caveat: if any hash_search(HASH_ENTER) failed when you try to reinsert them > into the pendingOpsTable, you have to raise the error to PANIC since we > can't get back the missing fds any more. Yes, the patch is wrong as-is because it may lose uncompleted fsyncs. But I think that we could just add the AbsorbFsyncRequests call in the fsync loop and not worry about trying to avoid doing extra fsyncs. Another possibility is to make the copied list as in the patch, but HASH_REMOVE an entry only after doing the fsync successfully --- as long as you don't AbsorbFsyncRequests between doing the fsync and removing the entry, you aren't risking missing a necessary fsync. I'm unconvinced that this is worth the trouble, however. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > Yes, the patch is wrong as-is because it may lose uncompleted fsyncs. > But I think that we could just add the AbsorbFsyncRequests call in the > fsync loop and not worry about trying to avoid doing extra fsyncs. > > Another possibility is to make the copied list as in the patch, but > HASH_REMOVE an entry only after doing the fsync successfully --- as long > as you don't AbsorbFsyncRequests between doing the fsync and removing > the entry, you aren't risking missing a necessary fsync. I'm > unconvinced that this is worth the trouble, however. > Maybe the take a copied list is safer. I got a little afraid of doing seqscan hash while doing HASH_ENTER at the same time. Do we have this kind of hash usage somewhere? Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > Maybe the take a copied list is safer. I got a little afraid of doing > seqscan hash while doing HASH_ENTER at the same time. Do we have this kind > of hash usage somewhere? Sure, it's perfectly safe. It's unspecified whether the scan will visit such entries or not (because it might or might not already have passed their hash bucket), but per above discussion we don't really care. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "ITAGAKI Takahiro" <itagaki.takahiro@lab.ntt.co.jp> wrote > >> AbsorbFsyncRequests will be called during the fsync loop in my patch, > >> so new files might be added to pendingOpsTable and they will be removed > >> from the table *before* writing the pages belonging to them. > > I think this fear is incorrect. At the time ForwardFsyncRequest is > called, the backend must *already* have done whatever write it is > concerned about fsync'ing. Oops, I was wrong. Also, I see that there is no necessity for fearing endless loops because hash-seqscan and HASH_ENTER don't conflict. Attached is a revised patch. It became very simple, but I worry that one magic number (BUFFERS_PER_ABSORB) is still left. --- ITAGAKI Takahiro NTT Cyber Space Laboratories
Attachment
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > Attached is a revised patch. It became very simple, but I worry that > one magic number (BUFFERS_PER_ABSORB) is still left. Have you checked that this version of the patch fixes the problem you saw originally? Does the problem come back if you change BUFFERS_PER_ABSORB to too large a value? If you can identify a threshold where the problem reappears in your test case, that would help us choose the right value to use. I suspect it'd probably be sufficient to absorb requests every few times through the fsync loop, too, if you want to experiment with that. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > > Attached is a revised patch. It became very simple, but I worry that > > one magic number (BUFFERS_PER_ABSORB) is still left. > > Have you checked that this version of the patch fixes the problem you > saw originally? Does the problem come back if you change > BUFFERS_PER_ABSORB to too large a value? The problem on my machine was resolved by this patch. I tested it and logged the maximum of BgWriterShmem->num_requests for each checkpoint. Test condition was: - shared_buffers = 65536 - connections = 30 The average of maximums was 25857 and the while max was 31807. They didn't exceed the max_requests(= 65536). > I suspect it'd probably be sufficient to absorb requests every few times > through the fsync loop, too, if you want to experiment with that. In the above test, smgrsync took 50 sec for syncing 32 files. This means absorb are requested every 1.5 sec, which is less frequent than absorbs by normal activity of bgwriter (bgwriter_delay=200ms). So I assume absorb requests the fsync loop would not be a problem. BUFFERS_PER_ABSORB = 10 (absorb per 1/10 of shared_buffers) is enough at least on my machine, but it doesn't necessarily work well in all environments. If we need to set BUFFERS_PER_ABSORB to a reasonably value, I think the number of active backends might be useful; for example, half of num of backends. --- ITAGAKI Takahiro NTT Cyber Space Laboratories
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I suspect it'd probably be sufficient to absorb requests every few times >> through the fsync loop, too, if you want to experiment with that. > In the above test, smgrsync took 50 sec for syncing 32 files. This means > absorb are requested every 1.5 sec, which is less frequent than absorbs by > normal activity of bgwriter (bgwriter_delay=200ms). That seems awfully high to me --- 1.5 sec to fsync a segment file that is never larger than 1Gb, and probably usually has much less than 1Gb of dirty data? I think you must have been testing an atypical case. I've applied the attached modified version of your patch. In this coding, absorbs are done after every 1000 buffer writes in BufferSync and after every 10 fsyncs in mdsync. We may need to twiddle these numbers but it seems at least in the right ballpark. If you have time to repeat your original test and see how this does, it'd be much appreciated. regards, tom lane
Attachment
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I've applied the attached modified version of your patch. In this > coding, absorbs are done after every 1000 buffer writes in BufferSync > and after every 10 fsyncs in mdsync. We may need to twiddle these > numbers but it seems at least in the right ballpark. If you have time > to repeat your original test and see how this does, it'd be much > appreciated. Thank you. It worked well on my machine(*). Undesirable behavior was not seen. (*) TPC-C(DBT-2) RHEL4 U1 (2.6.9-11) XFS, 8 S-ATA disks / 8GB memory(shmem=512MB) --- ITAGAKI Takahiro NTT Cyber Space Laboratories