Re: checkpointer continuous flushing - V18 - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: checkpointer continuous flushing - V18 |
Date | |
Msg-id | alpine.DEB.2.10.1602211037560.3927@sto Whole thread Raw |
In response to | Re: checkpointer continuous flushing - V18 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: checkpointer continuous flushing - V18
|
List | pgsql-hackers |
Hallo Andres, Here is a review for the second patch. > For 0002 I've recently changed: > * Removed the sort timing information, we've proven sufficiently that > it doesn't take a lot of time. I put it there initialy to demonstrate that there was no cache performance issue when sorting on just buffer indexes. As it is always small, I agree that it is not needed. Well, it could be still be in seconds on a very large shared buffers setting with a very large checkpoint, but then the checkpoint would be tremendously huge... > * Minor comment polishing. Patch applies and checks on Linux. * CpktSortItem: I think that allocating 20 bytes per buffer in shared memory is a little on the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also, there are very few tablespaces, they could be given a small number and this number could be used instead of the Oid, so the space requirement could be reduced to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping 4 bytes alignement and also getting 8 byte alignement... If this is too much, I have shown that it can work with only 4 bytes per buffer, as the sorting is really just a performance optimisation and is not broken if some stuff changes between sorting & writeback, but you did not like the idea. If the amount of shared memory required is a significant concern, it could be resurrected, though. * CkptTsStatus: As I suggested in the other mail, I think that this structure should also keep a per tablespace WritebackContext so that coalescing is done per tablespace. ISTM that "progress" and "progress_slice" only depend on num_scanned and per-tablespace num_to_scan and total num_to_scan, so they are somehow redundant and the progress could be recomputed from the initial figures when needed. If these fields are kept, I think that a comment should justify why float8 precision is okay for the purpose. I think it is quite certainly fine in the worst case with 32 bits buffer_ids, but it would not be if this size is changed someday. * BufferSync After a first sweep to collect buffers to write, they are sorted, and then there those buffers are swept again to compute some per tablespace data and organise a heap. ISTM that nearly all of the collected data on the second sweep could be collected on the first sweep, so that this second sweep could be avoided altogether. The only missing data is the index of the first buffer in the array, which can be computed by considering tablespaces only, sweeping over buffers is not needed. That would suggest creating the heap or using a hash in the initial buffer sweep to keep this information. This would also provide a point where to number tablespaces for compressing the CkptSortItem struct. I'm wondering about calling CheckpointWriteDelay on each round, maybe a minimum amount of write would make sense. This remark is independent of this patch. Probably it works fine because after a sleep the checkpointer is behind enough so that it will write a bunch of buffers before sleeping again. I see a binary_heap_allocate but no corresponding deallocation, this looks like a memory leak... or is there some magic involved? There are some debug stuff to remove in #ifdefs. I think that the buffer/README should be updated with explanations about sorting in the checkpointer. > I think this patch primarily needs: > * Benchmarking on FreeBSD/OSX to see whether we should enable the > mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm > inclined to leave it off till then. I do not have that. As "msync" seems available on Linux, it is possible to force using it with a "ifdef 0" to skip sync_file_range and check whether it does some good there. Idem for the "posix_fadvise" stuff. I can try to do that, but it takes time to do so, if someone can test on other OS it would be much better. I think that if it works it should be kept in, so it is just a matter of testing it. -- Fabien.
pgsql-hackers by date: