Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1508101818070.13396@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing
List pgsql-hackers
Hello Andres,

Thanks for your comments. Some answers and new patches included.

>> + /*
>> + * Array of buffer ids of all buffers to checkpoint.
>> + */
>> +static int *CheckpointBufferIds = NULL;
>
> Should be at the beginning of the file. There's a bunch more cases of that.

done.

>> +/* Compare checkpoint buffers
>> + */
>> +static int bufcmp(const int * pa, const int * pb)
>> +{
>> +    BufferDesc
>> +        *a = GetBufferDescriptor(*pa),
>> +        *b = GetBufferDescriptor(*pb);
>
> This definitely needs comments about ignoring the normal buffer header
> locking.

Added.

> Why are we ignoring the database directory? I doubt it'll make a huge
> difference, but grouping metadata affecting operations by directory
> helps.

I wanted to do the minimal comparisons to order buffers per file, so I 
skipped everything else. My idea of a checkpoint is a lot of data in a few 
files (at least compared to the data...), so I do not think that it is 
worth it. I may be proven wrong!

>> +static void
>> +AllocateCheckpointBufferIds(void)
>> +{
>> +    /* Safe worst case allocation, all buffers belong to the checkpoint...
>> +     * that is pretty unlikely.
>> +     */
>> +    CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
>> +}
>
> (wrong comment style...)

Fixed.

> Heikki, you were concerned about the size of the allocation of this, 
> right? I don't think it's relevant - we used to allocate an array of 
> that size for the backend's private buffer pin array until 9.5, so in 
> theory we should be safe agains that. NBuffers is limited to INT_MAX/2 
> in guc.ċ, which ought to be sufficient?

I think that there is no issue with the current shared_buffers limit. I 
could allocate and use 4 GB on my laptop without problem. I added a cast 
to ensure that unsigned int are used for the size computation.

>> + /* + * Lazy allocation: this function is called through the 
>> checkpointer, + * but also by initdb. Maybe the allocation could be 
>> moved to the callers. + */ + if (CheckpointBufferIds == NULL) + 
>> AllocateCheckpointBufferIds(); +
>>
>
> I don't think it's a good idea to allocate this on every round.
> That just means a lot of page table entries have to be built and torn 
> down regularly. It's not like checkpoints only run for 1% of the time or 
> such.

Sure. It is not allocated on every round, it is allocated once on the 
first checkpoint, the variable tested is static. There is no free. Maybe
the allocation could be moved to the callers, though.

> FWIW, I still think it's a much better idea to allocate the memory once
> in shared buffers.

Hmmm. The memory does not need to be shared with other processes?

> It's not like that makes us need more memory overall, and it'll be huge 
> page allocations if configured. I also think that sooner rather than 
> later we're going to need more than one process flushing buffers, and 
> then it'll need to be moved there.

That is an argument. I think that it could wait for the need to actually 
arise.

>> +    /*
>> +     * Sort buffer ids to help find sequential writes.
>> +     *
>> +     * Note: buffers are not locked in anyway, but that does not matter,
>> +     * this sorting is really advisory, if some buffer changes status during
>> +     * this pass it will be filtered out later.  The only necessary property
>> +     * is that marked buffers do not move elsewhere.
>> +     */
>
> That reasoning makes it impossible to move the fsyncing of files into 
> the loop (whenever we move to a new file). That's not nice.

I do not see why. Moving rsync ahead is definitely an idea that you 
already pointed out, I have given it some thoughts, and it would require 
a carefull implementation and some restructuring. For instance, you do not 
want to issue fsync right after having done writes, you want to wait a 
little bit so that the system had time to write the buffers to disk.

> The formulation with "necessary property" doesn't seem very clear to me?

Removed.

> How about: /* * Note: Buffers are not locked in any way during sorting, 
> but that's ok: * A change in the buffer header is only relevant when it 
> changes the * buffer's identity. If the identity has changed it'll have 
> been * written out by BufferAlloc(), so there's no need for checkpointer 
> to * write it out anymore. The buffer might also get written out by a * 
> backend or bgwriter, but that's equally harmless. */

This new version included.

>>          Also, qsort implementation
>> +     * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
>> +     * because of these possible concurrent changes.
>
> Hm. Is that actually the case for our qsort implementation?

I think that it is hard to write a qsort which would fail that. That would 
mean that it would compare the same items twice, which would be 
inefficient.

> If the pivot element changes its identity won't the result be pretty 
> much random?

That would be a very unlikely event, given the short time spent in qsort. 
Anyway, this is not a problem, and is the beauty of the "advisory" sort: 
if the sort is wrong because of any such rare event, it just mean that the 
buffers would not be strictly in file order, which is currently the 
case.... Well, too bad, but the correctness of the checkpoint does not 
depend on it, that just mean that the checkpointer would come back twice 
on one file, no big deal.

>> +    if (checkpoint_sort && num_to_write > 1 && false)
>> +    {
>
> && false - Huh?

Probably Heikki tests.

>> +        qsort(CheckpointBufferIds, num_to_write,  sizeof(int),
>> +                  (int(*)(const void *, const void *)) bufcmp);
>> +
>
> Ick, I'd rather move the typecasts to the comparator.

Done.

>> +        for (i = 1; i < num_to_write; i++)
>> +        { [...]
>
> This really deserves some explanation.

I think that this version does not work. I've reinstated my version and a 
lot of comments in the attached patches.

Please find attached two combined patches which provide both features one 
after the other.

(a) shared buffer sorting
 - I took Heikki hint about restructuring the buffer selection in a   separate function, which makes the code much more
readable.
 - I also followed Heikki intention (I think) that only active   table spaces are considered in the switching loop.

(b) add asynchronous flushes on top of the previous sort patch



I think that the many performance results I reported show that the 
improvements need both features, and one feature without the other is much 
less effective at improving responsiveness, which is my primary concern.
The TPS improvements are just a side effect.

I did not remove the gucs: I think it could be kept so that people can 
test around with it, and they may be removed in the future? I would be 
also fine if they are removed.

There are a lot of comments in some places. I think that they should be 
kept because the code is subtle.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Summary of plans to avoid the annoyance of Freezing
Next
From: "Steve Thames"
Date:
Subject: pg_dump and search_path