Logical tape pause/resume - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Logical tape pause/resume |
Date | |
Msg-id | 34678beb-938e-646e-db9f-a7def5c44ada@iki.fi Whole thread Raw |
Responses |
Re: Logical tape pause/resume
(Simon Riggs <simon@2ndquadrant.com>)
Re: Logical tape pause/resume (Claudio Freire <klaussfreire@gmail.com>) Re: Logical tape pause/resume (Peter Geoghegan <pg@heroku.com>) |
List | pgsql-hackers |
One of the patches in Peter Geoghegan's Parallel tuplesort patch set [1] is to put a cap on the number of tapes to use for the sort. The reason is that each tape consumes 3 * BLCKSZ worth of memory, for the buffers. We decide the max number of tapes upfront, so if we decide that e.g. the max number of tapes is 1000, we reserve 1000 * 3 * BLCKSZ bytes of memory for the buffers, and that memory cannot then be used for the quicksorts to build the initial runs, even if it turns out later that we needed only a few tapes. That amounts to about 8% of the available memory. That's quite wasteful. Peter's approach of putting an arbitrary cap on the max number of tapes works, but I find it a bit hackish. And you still waste that 8% with smaller work_mem settings. When we finish writing an initial run to a tape, we keep the tape buffers around. That's the reason we need to reserve that memory. But there's no fundamental reason we have to do that. As I suggested in [2], we could flush the buffers to disk, and only keep in memory the location of the last block we wrote. If we need to write another run to the tape, we can reload the last incomplete block from the disk, and continue writing. Reloading the last block, requires an extra I/O. That's OK. It's quite rare to have a multi-pass sort these days, so it's a good bet that you don't need to do it. And if you have a very small work_mem, so that you need to do a multi-pass sort, having some more memory available for building the initial runs is probably worth it, and there's also a good chance that the block is still in the OS cache. So, here are two patches to that end: 1. The first patch changes the way we store the logical tapes on disk. Instead of using indirect blocks, HFS style, the blocks form a linked list. Each block has a trailer, with the block numbers of the previous and next block of the tape. That eliminates the indirect blocks, which simplifies the code quite a bit, and makes it simpler to implement the pause/resume functionality in the second patch. It also immediately reduces the memory needed for the buffers, from 3 to 1 block per tape, as we don't need to hold the indirect blocks in memory. 2. The second patch adds a new function LogicalTapePause(). A paused tape can be be written to, but it doesn't have an in-memory buffer. The last, incomplete block is written to disk, but if you call LogicalTapeWrite(), a new buffer is allocated and the last block is read back into memory. If you don't write to the tape anymore, and call LogicalTapeRewind() after LogicalTapePause(), nothing needs to be done, as the last block is already on disk. In addition to saving a little bit of memory, I'd like to do this refactoring because it simplifies the code. It's code that has stayed largely unchanged for the past 15 years, so I'm not too eager to touch it, but looking at the changes coming with Peter's parallel tuplesort patch set, I think this makes sense. The parallel tuplesort patch set adds code for "serializing" and "deserializing" a tape, which means writing the top indirect block to disk, so that it can be read back in in the leader process. That's awfully similar to the "pause/resume" functionality here, so by adding it now, we won't have to do it in the parallel tuplesort patch. (Admittedly, it's not a whole lot of code, but still.) [1] CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com [2] e8f44b63-4745-b855-7772-e8201906a4a1@iki.fi - Heikki
Attachment
pgsql-hackers by date: