Re: checkpointer continuous flushing - V18 - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing - V18
Date
Msg-id alpine.DEB.2.10.1602211037560.3927@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing - V18  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing - V18
List pgsql-hackers
Hallo Andres,

Here is a review for the second patch.

> For 0002 I've recently changed:
> * Removed the sort timing information, we've proven sufficiently that
>  it doesn't take a lot of time.

I put it there initialy to demonstrate that there was no cache performance 
issue when sorting on just buffer indexes. As it is always small, I agree 
that it is not needed. Well, it could be still be in seconds on a very 
large shared buffers setting with a very large checkpoint, but then the 
checkpoint would be tremendously huge...

> * Minor comment polishing.

Patch applies and checks on Linux.

* CpktSortItem:

I think that allocating 20 bytes per buffer in shared memory is a little 
on the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 
bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also, 
there are very few tablespaces, they could be given a small number and 
this number could be used instead of the Oid, so the space requirement 
could be reduced to say 16 bytes per buffer by combining space & fork in 2 
shorts and keeping 4 bytes alignement and also getting 8 byte 
alignement... If this is too much, I have shown that it can work with only 
4 bytes per buffer, as the sorting is really just a performance 
optimisation and is not broken if some stuff changes between sorting & 
writeback, but you did not like the idea. If the amount of shared memory 
required is a significant concern, it could be resurrected, though.

* CkptTsStatus:

As I suggested in the other mail, I think that this structure should also keep
a per tablespace WritebackContext so that coalescing is done per tablespace.

ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.

If these fields are kept, I think that a comment should justify why float8 
precision is okay for the purpose. I think it is quite certainly fine in 
the worst case with 32 bits buffer_ids, but it would not be if this size 
is changed someday.

* BufferSync

After a first sweep to collect buffers to write, they are sorted, and then 
there those buffers are swept again to compute some per tablespace data 
and organise a heap.

ISTM that nearly all of the collected data on the second sweep could be 
collected on the first sweep, so that this second sweep could be avoided 
altogether. The only missing data is the index of the first buffer in the 
array, which can be computed by considering tablespaces only, sweeping 
over buffers is not needed. That would suggest creating the heap or using 
a hash in the initial buffer sweep to keep this information. This would 
also provide a point where to number tablespaces for compressing the 
CkptSortItem struct.

I'm wondering about calling CheckpointWriteDelay on each round, maybe
a minimum amount of write would make sense. This remark is independent of 
this patch. Probably it works fine because after a sleep the checkpointer 
is behind enough so that it will write a bunch of buffers before sleeping
again.

I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?

There are some debug stuff to remove in #ifdefs.

I think that the buffer/README should be updated with explanations about
sorting in the checkpointer.

> I think this patch primarily needs:
> * Benchmarking on FreeBSD/OSX to see whether we should enable the
>  mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
>  inclined to leave it off till then.

I do not have that. As "msync" seems available on Linux, it is possible to 
force using it with a "ifdef 0" to skip sync_file_range and check whether 
it does some good there. Idem for the "posix_fadvise" stuff. I can try to 
do that, but it takes time to do so, if someone can test on other OS it 
would be much better. I think that if it works it should be kept in, so it 
is just a matter of testing it.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Next
From: Christoph Berg
Date:
Subject: Re: Relaxing SSL key permission checks