Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+TgmoZ5g7UzMvM_42YMG8nbhOYpH+u5OMMnePJkYtT5HWotUw@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs |
List | pgsql-hackers |
On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > [ new patches ] I took a look at 0012 today, Amit's patch for extending the binary heap machinery, and 0013, Amit's patch for "Infrastructure to register and fetch undo action requests." I don't think that binaryheap_allocate_shm() is a good design. It presupposes that we want to store the binary heap as its own chunk of shared memory allocated via ShmemInitStruct(), but we might want to do something else, like embed in another structure, store it in a DSM or DSA, etc., and this function can't do any of that. I think we should have something more like: extern Size binaryheap_size(int capacity); extern void binaryheap_initialize(binaryheap *, int capacity, binaryheap_comparator compare, void *arg); Then the caller can do something like: sz = binaryheap_size(capacity); bh = ShmemInitStruct(name, sz, &found); if (!found) binaryheap_initialize(bh, capacity, comparator, whatever); If it wants to get the memory in some other way, it just needs to initialize bh differently; the rest is the same. Note that there is no need, in this design, for binaryheap_size/initialize to make use of "shared" memory. They could equally well be used on backend-local memory. They do not need to care. You just provide the memory, and they do their thing. I wasn't very happy about binaryheap_nth(), binaryheap_remove_nth(), and binaryheap_remove_nth_unordered() and started looking at how they are used to try to see if there might be a better way. That led me to look at 0013. Unfortunately, I find it really hard to understand what this code is actually doing. There's a lot of redundant and badly-written stuff in here. As a general principle, if you have two or three data structures of some particular type, you don't write a separate family of functions for manipulating each one. You write one function for each operation, and you pass the particular copy of the data structure with which you are working as an argument. In the lengthy section of macros definitions at the top of undorequest.c, we have macros InitXidQueue, XidQueueIsEmpty, GetXidQueueSize, GetXidQueueElem, GetXidQueueTopElem, GetXidQueueNthElem, and SetXidQueueElem. Several of these are used in only one place or are not used anywhere at all; those should be removed altogether and inlined into the single call site if there is one. Now, then after this, there is a matching set of macros, InitSizeQueue, SizeQueueIsEmpty, GetSizeQueueSize, GetSizeQueueElem, GetSizeQueueTopElem, GetSizeQueueNthElem, and SetSizeQueueElem. Many of these macros are exactly the same as the previous set of macros except that they operate on a different queue, which as I mentioned in the previous paragraph, is not a good design. It leads to extensive code duplication. Look, for example, at RemoveOldElemsFromSizeQueue and RemoveOldElemsFromXidQueue. They are basically identical except for s/Size/Xid/g and s/SIZE/XID/g, but you can't unify them easily because they are calling different functions. However, if you didn't have one function called GetSizeQueueSize and another called GetXidQueueSize, but just had a pointer to the relevant binary heap, then both functions could just call binaryheap_empty() on it, which would be better style, use fewer macros, generate less machine code, and be easier to read. Ideally, you'd get to the point where you could just have one function rather than two, and pass the queue upon which it should operate as an argument. There seems to be a good deal of this kind of duplication in this file and it really needs to be cleaned up. Now, one objection to the above line of attack is the different queues actually contain different types of elements. Apparently, the XID queue contains elements of type UndoXidQueue and the size queue contains elements of type UndoSizeQueue. It is worth noting here that these are bad type names, because they sound like they are describing a type of queue, but it seems that they are actually describing an element in the queue. However, there are two larger problems: 1. I don't think we should have three different kinds of objects for each of the three different queues. It seems like it would be much simpler and easier to just have one kind of object that stores all the information we need (full_xid, start_urec_ptr, dbid, request_size, next_retry_at, error_ocurred_at) and use that everywhere. You could object that this would increase the storage space requirement, but it wouldn't be enough to make any real difference and it probably would be well worth it for the avoidance of complexity. 2. However, I don't think we should have a separate request object for each queue anyway. We should insert pointers to the same objects in all the relevant queue (either size + XID, or else error). So instead of having three sets of objects, one for each queue, we'd just have one set of objects and point to them with as many as two pointers. We'd therefore need LESS memory than we're using today, because we wouldn't have separate arrays for XID, size, and error queue elements. In fact, it seems to me that we shouldn't have any such thing as "queue entries" at all. The queues should just be pointing to RollbackHashEntry *, and we should add all the fields there that are present in any of the "queue entry" structures. This would use less memory still. I also think we should be using simplehash rather than dynahash. I'm not sure that I would really say that simplehash is "simple," but it does have a nicer API and simpler memory management. There's just a big old block of memory, and there's no incremental allocation. That keeps things simple for the code that wants to go through the queues and removing dangling pointers. I think that the way this should work is that each RollbackHashEntry * should contain a field "bool active." Then: 1. When we pull an item out of one of the binary heaps, we check the active flag. If it's clear, we ignore the entry and pull the next item. If it's set, we clear the flag and process the item, so that if it's subsequently pulled from the other queue it will be ignored. 2. If a binary heap is full when we need to insert into it, we can iterate over all of the elements and throw away any that are !active. They've already been dequeued and processed from some other queue, so they're not "really" in this queue any more, even though we haven't gone to the trouble of actually kicking them out yet. On another note, UNDO_PEEK_DEPTH is bogus. It's used in UndoGetWork() and it passes the depth argument down to GetRollbackHashKeyFromQueue, which then does binaryheap_nth() on the relevant queue. Note that this function is another places that is ends up duplicating code because of the questionable decision to have separate types of queue entries for each different queue; otherwise, it could probably just take the binary heap into which it's peeking as an argument instead of having three different cases. But that's not the main point here. The main point is that it calls a function for whichever type of queue we've got and gets some kind of queue entry using binaryheap_nth(). But binaryheap_nth(whatever, 2) does not give you the third-smallest element in the binary heap. It gives you the third entry in the array, which may or may not have the heap property, but even if it does, the third element could be huge. Consider this binary heap: 0 1 100000 2 3 100001 100002 4 5 6 7 100003 100004 100005 100006 This satisfies the binary heap property, because the element at position n is always smaller than the elements at positions 2n+1 and 2n+2 (assuming 0-based indexing). But if you want to look at the smallest three elements in the heap, you can't just look at indexes 0..2. The second-smallest element must be at index 1 or 2, but it could be either place. The third-smallest element could be the other of 1 and 2, or it could be either child of the smaller one, so there are three places it might be. In general, a binary heap is not a good data structure for finding the smallest N elements of a collection unless N is 1, and what's going to happen with what you've got here is that we'll sometimes prioritize an item that would not have been pulled from the queue for a long time over one that would have otherwise been processed much sooner. I'm not sure that's a show-stopper, but it doesn't seem good, and the current patch doesn't seem to have any comments justifying it, or at least not in the places nearby to where this is actually happening. I think there are more problems here, too. Let's suppose that we fixed the problem described in the previous paragraph somehow, or decided that it won't actually make a big difference and just ignored it. Suppose further that we have N active databases which are generating undo requests. Luckily, we happen to also have N undo workers available, and let's suppose that as of a certain moment in time there is exactly one worker in each database. Think about what will happen when one of those workers goes to look for the next undo request. It's likely that the first request in the queue will be for some other database, so it's probably going to have to peak ahead to find a request for the database to which it's connected -- let's just assume that there is one. How far will it have to peak ahead? Well, if the requests are uniformly distributed across databases, each request has a 1-in-N chance of being the right one. I wrote a little Perl program to estimate the probability that we won't find the next request for our databases within 10 requests as a function of the number of databases: 1 databases => failure chance with 10 lookahead is 0.00% 2 databases => failure chance with 10 lookahead is 0.10% 3 databases => failure chance with 10 lookahead is 1.74% 4 databases => failure chance with 10 lookahead is 5.66% 5 databases => failure chance with 10 lookahead is 10.74% 6 databases => failure chance with 10 lookahead is 16.18% 7 databases => failure chance with 10 lookahead is 21.45% 8 databases => failure chance with 10 lookahead is 26.31% 9 databases => failure chance with 10 lookahead is 30.79% 10 databases => failure chance with 10 lookahead is 34.91% 11 databases => failure chance with 10 lookahead is 38.58% 12 databases => failure chance with 10 lookahead is 41.85% 13 databases => failure chance with 10 lookahead is 44.91% 14 databases => failure chance with 10 lookahead is 47.69% 15 databases => failure chance with 10 lookahead is 50.12% 16 databases => failure chance with 10 lookahead is 52.34% 17 databases => failure chance with 10 lookahead is 54.53% 18 databases => failure chance with 10 lookahead is 56.39% 19 databases => failure chance with 10 lookahead is 58.18% 20 databases => failure chance with 10 lookahead is 59.86% Assuming my script (attached) doesn't have a bug, with only 8 databases, there's better than a 1-in-4 chance that we'll fail to find the next entry for the current database within the lookahead window. That's bad, because then the worker will be sitting around waiting when it should be doing stuff. Maybe it will even exit, even though there's work to be done, and even though all the other databases have their own workers already. You can construct way worse examples than this one, too: imagine that there are two databases, each with a worker, and one has 99% of the requests and the other one has 1% of the requests. It's really unlikely that there's going to be an entry for the second database within the lookahead window. And note that increasing the window doesn't really help either: you just need more databases than the size of the lookahead window, or even almost as many as the lookahead window, and things are going to stop working properly. On the other hand, suppose that you have 10 databases and one undo worker. One database is pretty active and generates a continuous stream of undo requests at exactly the same speed we can process them. The others all have 1 pending undo request. Now, what's going to happen is that you'll always find the undo request for the current database within the lookahead window. So, you'll never exit. But that means the undo requests in the other 9 databases will just sit there for all eternity, because there's no other worker to process them. On the other hand, if you had 11 databases, there's a good chance it would work fine, because the new request for the active database would likely be outside the lookahead window, and so you'd find no work to do and exit, allowing a worker to be started up in some other database. It would in turn exit and so on and you'd clear the backlog for the other databases at least for a while, until you picked the active database again. Actually, I haven't looked at the whole patch set, so perhaps there is some solution to this problem contemplated somewhere, but I consider this argument to be pretty good evidence that a fixed lookahead distance is probably the wrong thing. The right things to do about these problems probably need some discussion, but here's the best idea I have off-hand: instead of just have 3 binary heaps (size, XID, error), have N+1 "undo work trackers", each of which contains 3 binary heaps (size, XID, error). Undo work tracker #0 contains all requests that are not assigned to any other undo work tracker. Each of undo work trackers #1..N contain all the requests for one particular database, but they all start out unused. Before launching an undo worker for a particular database, the launcher must check whether it has an undo work tracker allocated to that database. If not, it allocates one and moves all the work for that database out of tracker #0 and into the newly-allocated tracker. If there are none free, it must first deallocate an undo work tracker, moving any remaining work for that tracker back into tracker #0. With this approach, there's no need for lookahead, because every worker is always pulling from a queue that is database-specific, so the next entry is always guaranteed to be relevant. And you choose N to be equal to the number of workers, so that even if every worker is in a separate database there will be enough trackers for all workers to have one, plus tracker #0 for whatever's left. There still remains the problem of figuring out when a worker should terminate to allow for new workers to be launched, which is a fairly complex problem that deserves its own discussion, but I think this design helps. At the very least, you can see whether tracker #0 is empty. If it is, you might still want to rebalance workers between databases, but you don't really need to worry about databases getting starved altogether, because you know that you can run a worker for every database that has any pending undo. If tracker #0 is non-empty but you have unused workers, you can just allocate trackers for the databases in tracker #0 and move stuff over there to be processed. If tracker #0 is non-empty and all workers are allocated, you are going to need to ask one of them to exit at some point, to avoid starvation. I don't know exactly what the algorithm for that should be; I do have some ideas. I'm not going to include them in this email though, because this email is already long and I don't have time to make it longer right now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: