Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1509092056380.21932@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing
List pgsql-hackers
Hello Andres,

>>> Wouldn't it be just as easy to put this logic into the checkpointing 
>>> code?
>>
>> Not sure it would simplify anything, because the checkpointer currently
>> knows about buffers but flushing is about files, which are hidden from
>> view.
>
> It'd not really simplify things, but it'd keep it local.

Ok, it would be local, but it would also mean that the checkpointer would 
have to deal explicitely with files, whereas it currently does not have 
to.

I think that the current buffer/file boundary is on engineering principle 
a good one, so I tried to break it as little as possible to enable the 
feature, and I wanted to avoid to have to do a buffer to file translation 
twice, once in the checkpointer and once when writing the buffer.

>>> * Wouldn't a binary heap over the tablespaces + progress be nicer?
>>
>> I'm not sure where it would fit exactly.
>
> Imagine a binaryheap.h style heap over a structure like (tablespaceid,
> progress, progress_inc, nextbuf) where the comparator compares the progress.

It would replace what is currently an array. The balancing code needs to 
enumerate all tablespaces in a round-robin way so as to ensure that all 
tablespaces are given some attention, otherwise you could have a balance 
on two tablespaces but others could be left out. The array makes this 
property straightforward.


>> Anyway, I think it would complicate the code significantly (compared to the
>> straightforward array)
>
> I doubt it. I mean instead of your GetNext you'd just do:
>    next_tblspc = DatumGetPointer(binaryheap_first(heap));
>    if (next_tblspc == 0)
>        return 0;
>    next_tblspc.progress += next_tblspc.progress_slice;
>    binaryheap_replace_first(PointerGetDatum(next_tblspc));
>
>    return next_tblspc.nextbuf++;

Compare to the array, this tree approach would required ln(Ntablespace) 
work to extract and reinsert the tablespace under progress, so there is no 
complexity advantage.

Moreover, given that in most cases there are 1 or 2 tablespaces, a tree 
structure is really on the heavy side.

> progress_slice is the number of buffers in the tablespace divided by the
> number of total buffers, to avoid doing any sort of expensive math in
> the more frequently executed path.

If there are many buffers, I'm not too sure about rounding issues and the 
like, so the current approach with a rational seems more secure.

> [...] I'm not seing where you'd need an extra pointer?

Indeed, I misunderstood.

> [...] What for would you need to bsearch?

To find the tablespace boundaries in the sorted buffer array in 
log(NBuffers) * Ntablespace, instead of NBuffers.

>> I think that the current approach is ok as the number of tablespace
>> should be small.
>
> Right that's often the case.

Yep.

>> ISTM that using a tablespace in the sorting would reduce the complexity to
>> ln(NBuffers) * Ntablespace for finding the boundaries, and then Nbuffers *
>> (Ntablespace/Ntablespace) = NBuffers for scanning, at the expense of more
>> memory and code complexity.
>
> Afaics finding the boundaries can be done as part of the enumeration of
> tablespaces in BufferSync(). That code needs to be moved, but that's not
> too bad. I don't see the code be that much more complicated?

Hmmm. you are proposing to replace prooved and heavilly tested code by a 
more complex tree data structures distributed quite differently around the 
source, and no very clear benefit.

So I would prefer to keep the code as is, that is pretty straightforward, 
and wait for a strong incentive before doing anything fancier. ISTM that 
there are other places in pg need attention more that further tweaking 
this patch.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: checkpointer continuous flushing
Next
From: Andres Freund
Date:
Subject: Re: checkpointer continuous flushing