Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1508101942030.13396@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing
Re: checkpointer continuous flushing
List pgsql-hackers
Hello Andres,

> You can't allocate 4GB with palloc(), it has a builtin limit against
> allocating more than 1GB.

Argh, too bad, I assumed very naively that palloc was malloc in disguise.

>> [...]
> Well, then everytime the checkpointer is restarted.

Hm...

> The point is that it's done at postmaster startup, and we're pretty much
> guaranteed that the memory will availabl.e.

Ok ok, I stop resisting... I'll have a look.

Would it also fix the 1 GB palloc limit on the same go? I guess so...


>>> That reasoning makes it impossible to move the fsyncing of files into the
>>> loop (whenever we move to a new file). That's not nice.
>>
>> I do not see why.
>
> Because it means that the sorting isn't necessarily correct. I.e. we
> can't rely on it to determine whether a file has already been fsynced.

Ok, I understand your point.

Then the file would be fsynced twice: if the fsync is done properly (data 
have already been flushed to disk) then it would not cost much, and doing 
it sometimes twice on some file would not be a big issue. The code could 
also detect such event and log a warning, which would give a hint about 
how often it occurs in practice.

>>> Hm. Is that actually the case for our qsort implementation?
>>
>> I think that it is hard to write a qsort which would fail that. That would
>> mean that it would compare the same items twice, which would be inefficient.
>
> What? The same two elements aren't frequently compared pairwise with 
> each other, but of course an individual element is frequently compared 
> with other elements.

Sure.

> Consider what happens when the chosen pivot element changes its identity 
> after already dividing half. The two partitions will not be divided in 
> any meaning full way anymore. I don't see how this will results in a 
> meaningful sort.

It would be partly meaningful, which is enough for performance, and does 
not matter for correctness: currently buffers are not sorted at all and it 
works, even if it does not work well.

>>> If the pivot element changes its identity won't the result be pretty much
>>> random?
>>
>> That would be a very unlikely event, given the short time spent in
>> qsort.
>
> Meh, we don't want to rely on "likeliness" on such things.

My main argument is that even if it occurs, and the qsort result is partly 
wrong, it does not change correctness, it just mean that the actual writes 
will be less in order than wished. If it occurs, one pivot separation 
would be quite strange, but then others would be right, so the buffers 
would be "partly sorted".

Another issue I see is that even if buffers are locked within cmp, the 
status may change between two cmp... I do not think that locking all 
buffers for sorting them is an option. So on the whole, I think that 
locking buffers for sorting is probably not possible with the simple (and 
efficient) lightweight approach used in the patch.

The good news, as I argued before, is that the order is only advisory to 
help with performance, but the correctness is really that all checkpoint 
buffers are written and fsync is called in the end, and does not depend on 
the buffer order. That is how it currently works anyway.

If you block on this then I'll put a heavy weight approach, but that would 
be a waste of memory in my opinion, hence my argumentation for the 
lightweight approach.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Summary of plans to avoid the annoyance of Freezing
Next
From: Waldemar Brodkorb
Date:
Subject: linux sparc compile issue