Re: checkpointer continuous flushing - V18 - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing - V18
Date
Msg-id alpine.DEB.2.10.1602202052260.3927@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing - V18  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing - V18
List pgsql-hackers
Hello Andres,

> For 0001 I've recently changed:
> * Don't schedule writeback after smgrextend() - that defeats linux
>  delayed allocation mechanism, increasing fragmentation noticeably.
> * Add docs for the new GUC variables
> * comment polishing
> * BackendWritebackContext now isn't dynamically allocated anymore
>
>
> I think this patch primarily needs:
> * review of the docs, not sure if they're easy enough to
>  understand. Some language polishing might also be needed.

Yep, see below.

> * review of the writeback API, combined with the smgr/md.c changes.

See various comments below.

> * Currently *_flush_after can be set to a nonzero value, even if there's
>  no support for flushing on that platform. Imo that's ok, but perhaps
>  other people's opinion differ.

In some previous version I think a warning was shown of the feature was
requested but not available.


Here are some quick comments on the patch:

Patch applies cleanly on head. Compiled and checked on Linux. Compilation 
issues on other systems, see below.

When pages are written by a process (checkpointer, bgwriter, backend worker),
the list of recently written pages is kept and every so often an advisory
fsync (sync_file_range, other options for other systems) is issued so that
the data is sent to the io system without relying on more or less
(un)controllable os policy.

The documentation seems to use "flush" but the code talks about "writeback"
or "flush", depending. I think one vocabulary, whichever it is, should be
chosen and everything should stick to it, otherwise everything look kind of
fuzzy and raises doubt for the reader (is it the same thing? is it something
else?). I initially used "flush", but it seems a bad idea because it has
nothing to do with the flush function, so I'm fine with writeback or anything
else, I just think that *one* word should be chosen and used everywhere.

The sgml documentation about "*_flush_after" configuration parameter talks
about bytes, but the actual unit should be buffers. I think that keeping
a number of buffers should be fine, because that is what the internal stuff
will manage, not bytes. Also, the maximum value (128 ?) should appear in
the text. In the discussion in the wal section, I'm not sure about the effect
of setting writebacks on SSD, but I think that you have made some tests so
maybe you have an answer and the corresponding section could be written with
some more definitive text than "probably brings no benefit on SSD".

A good point of the whole approach is that it is available to all kind
of pg processes. However it does not address the point that bgwriter and
backends basically issue random writes, so I would not expect much positive
effect before these writes are somehow sorted, which means doing some
compromise in the LRU/LFU logic... well, all this is best kept for later,
and I'm fine to have the logic flushing logic there. I'm wondering why you
choose 16 & 64 as default for backends & bgwriter, though.

IssuePendingWritebacks: you merge only strictly neightboring writes.
Maybe the merging strategy could be more aggressive than just strict
neighbors?

mdwriteback: all variables could be declared within the while, I do not
understand why some are in and some are out. ISTM that putting writeback
management at the relation level does not help a lot, because you have to
translate again from relation to files. The good news is that it should work
as well, and that it does avoid the issue that the file may have been closed
in between, so why not.

The PendingWriteback struct looks useless. I think it should be removed,
and maybe put back if one day if it is needed, which I rather doubt it.

struct WritebackContext: keeping a pointer to guc variables is a kind of
trick, I think it deserves a comment.

ScheduleBufferTagForWriteback: the "pending" variable is not very useful.
Maybe consider shortening the "pending_writebacks" field name to "writebacks"?

IssuePendingWritebacks: I understand that qsort is needed "again"
because when balancing writes over tablespaces they may be intermixed.
AFAICR I used a "flush context" for each table space in some version
I submitted, because I do think that this whole writeback logic really
does make sense *per table space*, which suggest that there should be as
many write backs contexts as table spaces, otherwise the positive effect
may going to be totally lost of tables spaces are used. Any thoughts?

Assert(*context->max_pending <= WRITEBACK_MAX_PENDING_FLUSHES); is always
true, I think, it is already checked in the initialization and when setting
gucs.

SyncOneBuffer: I'm wonder why you copy the tag after releasing the lock.
I guess it is okay because it is still pinned.

pg_flush_data: in the first #elif, "context" is undeclared line 446.
Label "out" is not defined line 455. In the second #elif, "context" is
undeclared line 490 and label "out" line 500 is not defined either.

For the checkpointer, a key aspect is that the scheduling process goes
to sleep from time to time, and this sleep time looked like a great
opportunity to do this kind of flushing. You choose not to take advantage
of the behavior, why?

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Tutorial document
Next
From: Andres Freund
Date:
Subject: Re: checkpointer continuous flushing - V18