Re: checkpointer continuous flushing - V18 - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: checkpointer continuous flushing - V18 |
Date | |
Msg-id | alpine.DEB.2.10.1602202052260.3927@sto Whole thread Raw |
In response to | Re: checkpointer continuous flushing - V18 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: checkpointer continuous flushing - V18
|
List | pgsql-hackers |
Hello Andres, > For 0001 I've recently changed: > * Don't schedule writeback after smgrextend() - that defeats linux > delayed allocation mechanism, increasing fragmentation noticeably. > * Add docs for the new GUC variables > * comment polishing > * BackendWritebackContext now isn't dynamically allocated anymore > > > I think this patch primarily needs: > * review of the docs, not sure if they're easy enough to > understand. Some language polishing might also be needed. Yep, see below. > * review of the writeback API, combined with the smgr/md.c changes. See various comments below. > * Currently *_flush_after can be set to a nonzero value, even if there's > no support for flushing on that platform. Imo that's ok, but perhaps > other people's opinion differ. In some previous version I think a warning was shown of the feature was requested but not available. Here are some quick comments on the patch: Patch applies cleanly on head. Compiled and checked on Linux. Compilation issues on other systems, see below. When pages are written by a process (checkpointer, bgwriter, backend worker), the list of recently written pages is kept and every so often an advisory fsync (sync_file_range, other options for other systems) is issued so that the data is sent to the io system without relying on more or less (un)controllable os policy. The documentation seems to use "flush" but the code talks about "writeback" or "flush", depending. I think one vocabulary, whichever it is, should be chosen and everything should stick to it, otherwise everything look kind of fuzzy and raises doubt for the reader (is it the same thing? is it something else?). I initially used "flush", but it seems a bad idea because it has nothing to do with the flush function, so I'm fine with writeback or anything else, I just think that *one* word should be chosen and used everywhere. The sgml documentation about "*_flush_after" configuration parameter talks about bytes, but the actual unit should be buffers. I think that keeping a number of buffers should be fine, because that is what the internal stuff will manage, not bytes. Also, the maximum value (128 ?) should appear in the text. In the discussion in the wal section, I'm not sure about the effect of setting writebacks on SSD, but I think that you have made some tests so maybe you have an answer and the corresponding section could be written with some more definitive text than "probably brings no benefit on SSD". A good point of the whole approach is that it is available to all kind of pg processes. However it does not address the point that bgwriter and backends basically issue random writes, so I would not expect much positive effect before these writes are somehow sorted, which means doing some compromise in the LRU/LFU logic... well, all this is best kept for later, and I'm fine to have the logic flushing logic there. I'm wondering why you choose 16 & 64 as default for backends & bgwriter, though. IssuePendingWritebacks: you merge only strictly neightboring writes. Maybe the merging strategy could be more aggressive than just strict neighbors? mdwriteback: all variables could be declared within the while, I do not understand why some are in and some are out. ISTM that putting writeback management at the relation level does not help a lot, because you have to translate again from relation to files. The good news is that it should work as well, and that it does avoid the issue that the file may have been closed in between, so why not. The PendingWriteback struct looks useless. I think it should be removed, and maybe put back if one day if it is needed, which I rather doubt it. struct WritebackContext: keeping a pointer to guc variables is a kind of trick, I think it deserves a comment. ScheduleBufferTagForWriteback: the "pending" variable is not very useful. Maybe consider shortening the "pending_writebacks" field name to "writebacks"? IssuePendingWritebacks: I understand that qsort is needed "again" because when balancing writes over tablespaces they may be intermixed. AFAICR I used a "flush context" for each table space in some version I submitted, because I do think that this whole writeback logic really does make sense *per table space*, which suggest that there should be as many write backs contexts as table spaces, otherwise the positive effect may going to be totally lost of tables spaces are used. Any thoughts? Assert(*context->max_pending <= WRITEBACK_MAX_PENDING_FLUSHES); is always true, I think, it is already checked in the initialization and when setting gucs. SyncOneBuffer: I'm wonder why you copy the tag after releasing the lock. I guess it is okay because it is still pinned. pg_flush_data: in the first #elif, "context" is undeclared line 446. Label "out" is not defined line 455. In the second #elif, "context" is undeclared line 490 and label "out" line 500 is not defined either. For the checkpointer, a key aspect is that the scheduling process goes to sleep from time to time, and this sleep time looked like a great opportunity to do this kind of flushing. You choose not to take advantage of the behavior, why? -- Fabien.
pgsql-hackers by date: