Thread: Freezing without write I/O
Since we're bashing around ideas around freezing, let me write down the idea I've been pondering and discussing with various people for years. I don't think I invented this myself, apologies to whoever did for not giving credit. The reason we have to freeze is that otherwise our 32-bit XIDs wrap around and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but that would waste a lot space. The trick is to add a field to the page header indicating the 'epoch' of the XID, while keeping the XIDs in tuple header 32-bit wide (*). The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we could implicitly treat anything older than relfrozenxid as frozen. That's the basic idea. Vacuum freeze only needs to remove dead tuples, but doesn't need to dirty pages that contain no dead tuples. Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to replace the XIDs with FrozenXid whenever the difference between the smallest and largest XID on a page exceeds 2^31. But that would only happen when you're updating the page, in which case the page is dirtied anyway, so it wouldn't cause any extra I/O. This would also be the first step in allowing the clog to grow larger than 2 billion transactions, eliminating the need for anti-wraparound freezing altogether. You'd still want to truncate the clog eventually, but it would be nice to not be pressed against the wall with "run vacuum freeze now, or the system will shut down". (*) "Adding an epoch" is inaccurate, but I like to use that as my mental model. If you just add a 32-bit epoch field, then you cannot have xids from different epochs on the page, which would be a problem. In reality, you would store one 64-bit XID value in the page header, and use that as the "reference point" for all the 32-bit XIDs on the tuples. See existing convert_txid() function for how that works. Another method is to store the 32-bit xid values in tuple headers as offsets from the per-page 64-bit value, but then you'd always need to have the 64-bit value at hand when interpreting the XIDs, even if they're all recent. - Heikki
Heikki, This sounds a lot like my idea for 9.3, which didn't go anywhere. You've worked out the issues I couldn't, I think. > Another method is > to store the 32-bit xid values in tuple headers as offsets from the > per-page 64-bit value, but then you'd always need to have the 64-bit > value at hand when interpreting the XIDs, even if they're all recent. Yeah, -1 on the latter, not least because it would require a 100% rewrite of the tables in order to upgrade. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > The reason we have to freeze is that otherwise our 32-bit XIDs wrap around > and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but > that would waste a lot space. The trick is to add a field to the page header > indicating the 'epoch' of the XID, while keeping the XIDs in tuple header > 32-bit wide (*). Check. > The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we > wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we > could implicitly treat anything older than relfrozenxid as frozen. Check. > That's the basic idea. Vacuum freeze only needs to remove dead tuples, but > doesn't need to dirty pages that contain no dead tuples. Check. > Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to > replace the XIDs with FrozenXid whenever the difference between the smallest > and largest XID on a page exceeds 2^31. But that would only happen when > you're updating the page, in which case the page is dirtied anyway, so it > wouldn't cause any extra I/O. It would cause some extra WAL activity, but it wouldn't dirty the page an extra time. > This would also be the first step in allowing the clog to grow larger than 2 > billion transactions, eliminating the need for anti-wraparound freezing > altogether. You'd still want to truncate the clog eventually, but it would > be nice to not be pressed against the wall with "run vacuum freeze now, or > the system will shut down". Interesting. That seems like a major advantage. > (*) "Adding an epoch" is inaccurate, but I like to use that as my mental > model. If you just add a 32-bit epoch field, then you cannot have xids from > different epochs on the page, which would be a problem. In reality, you > would store one 64-bit XID value in the page header, and use that as the > "reference point" for all the 32-bit XIDs on the tuples. See existing > convert_txid() function for how that works. Another method is to store the > 32-bit xid values in tuple headers as offsets from the per-page 64-bit > value, but then you'd always need to have the 64-bit value at hand when > interpreting the XIDs, even if they're all recent. As I see it, the main downsides of this approach are: (1) It breaks binary compatibility (unless you do something to provided for it, like put the epoch in the special space). (2) It consumes 8 bytes per page. I think it would be possible to get this down to say 5 bytes per page pretty easily; we'd simply decide that the low-order 3 bytes of the reference XID must always be 0. Possibly you could even do with 4 bytes, or 4 bytes plus some number of extra bits. (3) You still need to periodically scan the entire relation, or else have a freeze map as Simon and Josh suggested. The upsides of this approach as compared with what Andres and I are proposing are: (1) It provides a stepping stone towards allowing indefinite expansion of CLOG, which is quite appealing as an alternative to a hard shut-down. (2) It doesn't place any particular requirements on PD_ALL_VISIBLE. I don't personally find this much of a benefit as I want to keep PD_ALL_VISIBLE, but I know Jeff and perhaps others disagree. Random thought: Could you compute the reference XID based on the page LSN? That would eliminate the storage overhead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 30, 2013 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> The reason we have to freeze is that otherwise our 32-bit XIDs wrap around >> and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but >> that would waste a lot space. The trick is to add a field to the page header >> indicating the 'epoch' of the XID, while keeping the XIDs in tuple header >> 32-bit wide (*). > > (3) You still need to periodically scan the entire relation, or else > have a freeze map as Simon and Josh suggested. Why is this scan required? Also, what happens if you delete a tuple on a page when another tuple on the same page with age > 2^32 that is still in an open transaction? merlin
On 30.05.2013 21:46, Merlin Moncure wrote: > On Thu, May 30, 2013 at 1:39 PM, Robert Haas<robertmhaas@gmail.com> wrote: >> On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> The reason we have to freeze is that otherwise our 32-bit XIDs wrap around >>> and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but >>> that would waste a lot space. The trick is to add a field to the page header >>> indicating the 'epoch' of the XID, while keeping the XIDs in tuple header >>> 32-bit wide (*). >> >> (3) You still need to periodically scan the entire relation, or else >> have a freeze map as Simon and Josh suggested. > > Why is this scan required? To find all the dead tuples and remove them, and advance relfrozenxid. That in turn is required so that you can truncate the clog. This scheme relies on assuming that everything older than relfrozenxid committed, so if there are any aborted XIDs present in the table, you can't advance relfrozenxid past them. Come to think of it, if there are no aborted XIDs in a range of XIDs, only commits, then you could just advance relfrozenxid past that range and truncate away the clog, without scanning the table. But that's quite a special case - generally there would be at least a few aborted XIDs - so it's probably not worth adding any special code to take advantage of that. > Also, what happens if you delete a tuple on a page when another tuple > on the same page with age> 2^32 that is still in an open transaction? Can't let that happen. Same as today. - Heikki
On 2013-05-30 14:39:46 -0400, Robert Haas wrote: > > Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to > > replace the XIDs with FrozenXid whenever the difference between the smallest > > and largest XID on a page exceeds 2^31. But that would only happen when > > you're updating the page, in which case the page is dirtied anyway, so it > > wouldn't cause any extra I/O. > > It would cause some extra WAL activity, but it wouldn't dirty the page > an extra time. You probably could do it similarly to how we currently do XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The insert/update/delete record will already contain a FPI if necessary, so that should be safe. > > This would also be the first step in allowing the clog to grow larger than 2 > > billion transactions, eliminating the need for anti-wraparound freezing > > altogether. You'd still want to truncate the clog eventually, but it would > > be nice to not be pressed against the wall with "run vacuum freeze now, or > > the system will shut down". > Interesting. That seems like a major advantage. Hm. Why? If freezing gets notably cheaper I don't really see much need for keeping that much clog around? If we still run into anti-wraparound areas, there has to be some major operational problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote: > This would also be the first step in allowing the clog to grow > larger than 2 billion transactions, eliminating the need for > anti-wraparound freezing altogether. You'd still want to truncate > the clog eventually, but it would be nice to not be pressed against > the wall with "run vacuum freeze now, or the system will shut down". Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog truncation. Does this help that? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, May 30, 2013 at 3:22 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-05-30 14:39:46 -0400, Robert Haas wrote: >> > Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to >> > replace the XIDs with FrozenXid whenever the difference between the smallest >> > and largest XID on a page exceeds 2^31. But that would only happen when >> > you're updating the page, in which case the page is dirtied anyway, so it >> > wouldn't cause any extra I/O. >> >> It would cause some extra WAL activity, but it wouldn't dirty the page >> an extra time. > > You probably could do it similarly to how we currently do > XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The > insert/update/delete record will already contain a FPI if necessary, so > that should be safe. Ah, good point. >> > This would also be the first step in allowing the clog to grow larger than 2 >> > billion transactions, eliminating the need for anti-wraparound freezing >> > altogether. You'd still want to truncate the clog eventually, but it would >> > be nice to not be pressed against the wall with "run vacuum freeze now, or >> > the system will shut down". > >> Interesting. That seems like a major advantage. > > Hm. Why? If freezing gets notably cheaper I don't really see much need > for keeping that much clog around? If we still run into anti-wraparound > areas, there has to be some major operational problem. That is true, but we have a decent number of customers who do in fact have such problems. I think that's only going to get more common. As hardware gets faster and PostgreSQL improves, people are going to process more and more transactions in shorter and shorter periods of time. Heikki's benchmark results for the XLOG scaling patch show rates of >80,000 tps. Even at a more modest 10,000 tps, with default settings, you'll do anti-wraparound vacuums of the entire cluster about every 8 hours. That's not fun. Being able to do such scans only of the not-all-visible pages would be a huge step forward, of course. But not having to do them on any particular deadline would be a whole lot better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 30, 2013 at 2:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Random thought: Could you compute the reference XID based on the page > LSN? That would eliminate the storage overhead. After mulling this over a bit, I think this is definitely possible. We begin a new "half-epoch" every 2 billion transactions. We remember the LSN at which the current half-epoch began and the LSN at which the previous half-epoch began. When a new half-epoch begins, the first backend that wants to stamp a tuple with an XID from the new half-epoch must first emit a "new half-epoch" WAL record, which becomes the starting LSN for the new half-epoch. We define a new page-level bit, something like PD_RECENTLY_FROZEN. When this bit is set, it means there are no unfrozen tuples on the page with XIDs that predate the current half-epoch. Whenever we know this to be true, we set the bit. If the page LSN crosses more than one half-epoch boundary at a time, we freeze the page and set the bit.If the page LSN crosses exactly one half-epoch boundary,then (1) if the bit is set, we clear it and (2) if the bit is not set, we freeze the page and set the bit. The advantage of this is that we avoid an epidemic of freezing right after a half-epoch change. Immediately after a half-epoch change, many pages will mix tuples from the current and previous half-epoch - but relatively few pages will have tuples from the current half-epoch and a half-epoch more than one in the past. As things stand today, we really only need to remember the last two half-epoch boundaries; they could be stored, for example, in the control file. But if we someday generalize CLOG to allow indefinite retention as you suggest, we could instead remember all half-epoch boundaries that have ever occurred; just maintain a file someplace with 8 bytes of data for every 2 billion XIDs consumed over the lifetime of the cluster. In fact, we might want to do it that way anyhow, just to keep our options open, and perhaps for forensics. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 31.05.2013 00:06, Bruce Momjian wrote: > On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote: >> This would also be the first step in allowing the clog to grow >> larger than 2 billion transactions, eliminating the need for >> anti-wraparound freezing altogether. You'd still want to truncate >> the clog eventually, but it would be nice to not be pressed against >> the wall with "run vacuum freeze now, or the system will shut down". > > Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog > truncation. Does this help that? No. If you want to keep autovacuum_freeze_max_age set at less than 2 billion, you don't need support for more than 2 billion transactions. But for those who would like to set autovacuum_freeze_max_age higher than 2B, it would be nice to allow it. Actually, even with autovacuum_freeze_max_age = 200 M, it would be nice to not have the hard stop at 2 billion, in case autovacuum falls behind really badly. With autovacuum_freeze_max_age = 200M, there's a lot of safety margin, but with 1000M or so, not so much. - Heikki
On Thu, May 30, 2013 at 10:04:23PM -0400, Robert Haas wrote: > > Hm. Why? If freezing gets notably cheaper I don't really see much need > > for keeping that much clog around? If we still run into anti-wraparound > > areas, there has to be some major operational problem. > > That is true, but we have a decent number of customers who do in fact > have such problems. I think that's only going to get more common. As > hardware gets faster and PostgreSQL improves, people are going to > process more and more transactions in shorter and shorter periods of > time. Heikki's benchmark results for the XLOG scaling patch show > rates of >80,000 tps. Even at a more modest 10,000 tps, with default > settings, you'll do anti-wraparound vacuums of the entire cluster > about every 8 hours. That's not fun. Are you assuming these are all write transactions, hence consuming xids? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Fri, May 31, 2013 at 1:26 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Thu, May 30, 2013 at 10:04:23PM -0400, Robert Haas wrote: >> > Hm. Why? If freezing gets notably cheaper I don't really see much need >> > for keeping that much clog around? If we still run into anti-wraparound >> > areas, there has to be some major operational problem. >> >> That is true, but we have a decent number of customers who do in fact >> have such problems. I think that's only going to get more common. As >> hardware gets faster and PostgreSQL improves, people are going to >> process more and more transactions in shorter and shorter periods of >> time. Heikki's benchmark results for the XLOG scaling patch show >> rates of >80,000 tps. Even at a more modest 10,000 tps, with default >> settings, you'll do anti-wraparound vacuums of the entire cluster >> about every 8 hours. That's not fun. > > Are you assuming these are all write transactions, hence consuming xids? Well, there might be read-only transactions as well, but the point is about how many write transactions there can be. 10,000 tps or more is not out of the question even today, and progressively higher numbers are only going to become more and more common. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 30 May 2013 14:33, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Since we're bashing around ideas around freezing, let me write down the idea > I've been pondering and discussing with various people for years. I don't > think I invented this myself, apologies to whoever did for not giving > credit. > > The reason we have to freeze is that otherwise our 32-bit XIDs wrap around > and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but > that would waste a lot space. The trick is to add a field to the page header > indicating the 'epoch' of the XID, while keeping the XIDs in tuple header > 32-bit wide (*). > > The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we > wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we > could implicitly treat anything older than relfrozenxid as frozen. > > That's the basic idea. Vacuum freeze only needs to remove dead tuples, but > doesn't need to dirty pages that contain no dead tuples. I have to say this is pretty spooky. I'd not read hackers all week, so I had no idea so many other people were thinking about freezing as well. This idea is damn near identical to what I've suggested. My suggestion came because I was looking to get rid of fields out of the tuple header; which didn't come to much. The good news is that is complete chance, so it must mean we're on the right track. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 30 May 2013 19:39, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> The reason we have to freeze is that otherwise our 32-bit XIDs wrap around >> and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but >> that would waste a lot space. The trick is to add a field to the page header >> indicating the 'epoch' of the XID, while keeping the XIDs in tuple header >> 32-bit wide (*). > > Check. > >> The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we >> wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we >> could implicitly treat anything older than relfrozenxid as frozen. > > Check. > >> That's the basic idea. Vacuum freeze only needs to remove dead tuples, but >> doesn't need to dirty pages that contain no dead tuples. > > Check. Yes, this is the critical point. Large insert-only tables don't need to be completely re-written twice. >> Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to >> replace the XIDs with FrozenXid whenever the difference between the smallest >> and largest XID on a page exceeds 2^31. But that would only happen when >> you're updating the page, in which case the page is dirtied anyway, so it >> wouldn't cause any extra I/O. > > It would cause some extra WAL activity, but it wouldn't dirty the page > an extra time. > >> This would also be the first step in allowing the clog to grow larger than 2 >> billion transactions, eliminating the need for anti-wraparound freezing >> altogether. You'd still want to truncate the clog eventually, but it would >> be nice to not be pressed against the wall with "run vacuum freeze now, or >> the system will shut down". > > Interesting. That seems like a major advantage. > >> (*) "Adding an epoch" is inaccurate, but I like to use that as my mental >> model. If you just add a 32-bit epoch field, then you cannot have xids from >> different epochs on the page, which would be a problem. In reality, you >> would store one 64-bit XID value in the page header, and use that as the >> "reference point" for all the 32-bit XIDs on the tuples. See existing >> convert_txid() function for how that works. Another method is to store the >> 32-bit xid values in tuple headers as offsets from the per-page 64-bit >> value, but then you'd always need to have the 64-bit value at hand when >> interpreting the XIDs, even if they're all recent. > > As I see it, the main downsides of this approach are: > > (1) It breaks binary compatibility (unless you do something to > provided for it, like put the epoch in the special space). > > (2) It consumes 8 bytes per page. I think it would be possible to get > this down to say 5 bytes per page pretty easily; we'd simply decide > that the low-order 3 bytes of the reference XID must always be 0. > Possibly you could even do with 4 bytes, or 4 bytes plus some number > of extra bits. Yes, the idea of having a "base Xid" on every page is complicated and breaks compatibility. Same idea can work well if we do this via tuple headers. > (3) You still need to periodically scan the entire relation, or else > have a freeze map as Simon and Josh suggested. I don't think that is needed with this approach. (The freeze map was Andres' idea, not mine. I just accepted it as what I thought was the only way forwards. Now I see other ways) > The upsides of this approach as compared with what Andres and I are > proposing are: > > (1) It provides a stepping stone towards allowing indefinite expansion > of CLOG, which is quite appealing as an alternative to a hard > shut-down. I would be against expansion of the CLOG beyond its current size. If we have removed all aborted rows and marked hints, then we don't need the CLOG values and can trim that down. I don't mind the hints, its the freezing we don't need. >> convert_txid() function for how that works. Another method is to store the >> 32-bit xid values in tuple headers as offsets from the per-page 64-bit >> value, but then you'd always need to have the 64-bit value at hand when >> interpreting the XIDs, even if they're all recent. You've touched here on the idea of putting the epoch in the tuple header, which is where what I posted comes together. We don't need anything at page level, we just need something on each tuple. Please can you look at my recent post on how to put this in the tuple header? --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 31.05.2013 06:02, Robert Haas wrote: > On Thu, May 30, 2013 at 2:39 PM, Robert Haas<robertmhaas@gmail.com> wrote: >> Random thought: Could you compute the reference XID based on the page >> LSN? That would eliminate the storage overhead. > > After mulling this over a bit, I think this is definitely possible. > We begin a new "half-epoch" every 2 billion transactions. We remember > the LSN at which the current half-epoch began and the LSN at which the > previous half-epoch began. When a new half-epoch begins, the first > backend that wants to stamp a tuple with an XID from the new > half-epoch must first emit a "new half-epoch" WAL record, which > becomes the starting LSN for the new half-epoch. Clever! Pages in unlogged tables need some extra treatment, as they don't normally have a valid LSN, but that shouldn't be too hard. > We define a new page-level bit, something like PD_RECENTLY_FROZEN. > When this bit is set, it means there are no unfrozen tuples on the > page with XIDs that predate the current half-epoch. Whenever we know > this to be true, we set the bit. If the page LSN crosses more than > one half-epoch boundary at a time, we freeze the page and set the bit. > If the page LSN crosses exactly one half-epoch boundary, then (1) if > the bit is set, we clear it and (2) if the bit is not set, we freeze > the page and set the bit. Yep, I think that would work. Want to write the patch, or should I? ;-) - Heikki
On 1 June 2013 19:48, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 31.05.2013 06:02, Robert Haas wrote: >> >> On Thu, May 30, 2013 at 2:39 PM, Robert Haas<robertmhaas@gmail.com> >> wrote: >>> >>> Random thought: Could you compute the reference XID based on the page >>> LSN? That would eliminate the storage overhead. >> >> >> After mulling this over a bit, I think this is definitely possible. >> We begin a new "half-epoch" every 2 billion transactions. We remember >> the LSN at which the current half-epoch began and the LSN at which the >> previous half-epoch began. When a new half-epoch begins, the first >> backend that wants to stamp a tuple with an XID from the new >> half-epoch must first emit a "new half-epoch" WAL record, which >> becomes the starting LSN for the new half-epoch. > > > Clever! Pages in unlogged tables need some extra treatment, as they don't > normally have a valid LSN, but that shouldn't be too hard. I like the idea of using the LSN to indicate the epoch. It saves any other work we might consider, such as setting page or tuple level epochs. >> We define a new page-level bit, something like PD_RECENTLY_FROZEN. >> When this bit is set, it means there are no unfrozen tuples on the >> page with XIDs that predate the current half-epoch. Whenever we know >> this to be true, we set the bit. If the page LSN crosses more than >> one half-epoch boundary at a time, we freeze the page and set the bit. >> If the page LSN crosses exactly one half-epoch boundary, then (1) if >> the bit is set, we clear it and (2) if the bit is not set, we freeze >> the page and set the bit. > > > Yep, I think that would work. Want to write the patch, or should I? ;-) If we set a bit, surely we need to write the page. Isn't that what we were trying to avoid? Why set a bit at all? If we know the LSN of the page and we know the epoch boundaries, then we can work out when the page was last written to and infer that the page is "virtually frozen". As soon as we make a change to a "virtually frozen" page, we can actually freeze it and then make the change. But we still have the problem of knowing which pages have been frozen and which haven't. Can we clear up those points first? Or at least my understanding of them. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> We define a new page-level bit, something like PD_RECENTLY_FROZEN. >> When this bit is set, it means there are no unfrozen tuples on the >> page with XIDs that predate the current half-epoch. Whenever we know >> this to be true, we set the bit. If the page LSN crosses more than >> one half-epoch boundary at a time, we freeze the page and set the bit. >> If the page LSN crosses exactly one half-epoch boundary, then (1) if >> the bit is set, we clear it and (2) if the bit is not set, we freeze >> the page and set the bit. > > Yep, I think that would work. Want to write the patch, or should I? ;-) Have at it. I think the tricky part is going to be figuring out the synchronization around half-epoch boundaries. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > If we set a bit, surely we need to write the page. Isn't that what we > were trying to avoid? No, the bit only gets set in situations when we were going to dirty the page for some other reason anyway. Specifically, if a page modification discovers that we've switched epochs (but just once) and the bit isn't already set, we can set it in lieu of scanning the entire page for tuples that need freezing. Under this proposal, pages that don't contain any dead tuples needn't be dirtied for freezing, ever. Smells like awesome. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1 June 2013 21:26, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> If we set a bit, surely we need to write the page. Isn't that what we >> were trying to avoid? > > No, the bit only gets set in situations when we were going to dirty > the page for some other reason anyway. Specifically, if a page > modification discovers that we've switched epochs (but just once) and > the bit isn't already set, we can set it in lieu of scanning the > entire page for tuples that need freezing. > > Under this proposal, pages that don't contain any dead tuples needn't > be dirtied for freezing, ever. Smells like awesome. Agreed, well done both. What I especially like about it is how little logic it will require, and no page format changes. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 31, 2013 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Even at a more modest 10,000 tps, with default > settings, you'll do anti-wraparound vacuums of the entire cluster > about every 8 hours. That's not fun. I've forgotten now. What happens if you have a long-lived transaction still alive from > 2B xid ago? -- greg
On 06.06.2013 15:16, Greg Stark wrote: > On Fri, May 31, 2013 at 3:04 AM, Robert Haas<robertmhaas@gmail.com> wrote: >> Even at a more modest 10,000 tps, with default >> settings, you'll do anti-wraparound vacuums of the entire cluster >> about every 8 hours. That's not fun. > > I've forgotten now. What happens if you have a long-lived transaction > still alive from> 2B xid ago? That will keep OldestXmin from advancing. Which will keep vacuum from advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings about wrap-around, then stops new XIDs from being generated, and finally a forced shutdown. The forced shutdown will actually happen some time before going beyond 2 billion XIDs. So it is not possible to have a long-lived transaction, older than 2 B XIDs, still live in the system. But let's imagine that you somehow bypass the safety mechanism: After wraparound, old tuples will look like being in the future, and will become invisible to new transactions. That happens even if there are no old transactions around. I'm not sure what exactly will happen if there is still a transaction alive with an XID and/or snapshots older than 2^31 XIDs. New tuples that are not supposed to be visible to the old snapshot would suddenly become visible, I guess. - Heikki
On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > That will keep OldestXmin from advancing. Which will keep vacuum from > advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings > about wrap-around, then stops new XIDs from being generated, and finally a > forced shutdown. > > The forced shutdown will actually happen some time before going beyond 2 > billion XIDs. So it is not possible to have a long-lived transaction, older > than 2 B XIDs, still live in the system. But let's imagine that you somehow > bypass the safety mechanism: Ah, so if you do the epoch in the page header thing or Robert's LSN trick that I didn't follow then you'll need a new safety check against this. Since relfrozenxid/datfrozenxid will no longer be necessary. -- greg
On Thu, Jun 6, 2013 at 6:28 PM, Greg Stark <stark@mit.edu> wrote: > On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> That will keep OldestXmin from advancing. Which will keep vacuum from >> advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings >> about wrap-around, then stops new XIDs from being generated, and finally a >> forced shutdown. >> >> The forced shutdown will actually happen some time before going beyond 2 >> billion XIDs. So it is not possible to have a long-lived transaction, older >> than 2 B XIDs, still live in the system. But let's imagine that you somehow >> bypass the safety mechanism: > > Ah, so if you do the epoch in the page header thing or Robert's LSN > trick that I didn't follow then you'll need a new safety check against > this. Since relfrozenxid/datfrozenxid will no longer be necessary. Nothing proposed here gets rid of either of those, that I can see. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07.06.2013 20:54, Robert Haas wrote: > On Thu, Jun 6, 2013 at 6:28 PM, Greg Stark<stark@mit.edu> wrote: >> On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> That will keep OldestXmin from advancing. Which will keep vacuum from >>> advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings >>> about wrap-around, then stops new XIDs from being generated, and finally a >>> forced shutdown. >>> >>> The forced shutdown will actually happen some time before going beyond 2 >>> billion XIDs. So it is not possible to have a long-lived transaction, older >>> than 2 B XIDs, still live in the system. But let's imagine that you somehow >>> bypass the safety mechanism: >> >> Ah, so if you do the epoch in the page header thing or Robert's LSN >> trick that I didn't follow then you'll need a new safety check against >> this. Since relfrozenxid/datfrozenxid will no longer be necessary. > > Nothing proposed here gets rid of either of those, that I can see. Right. The meaning of relfrozenxid/datfrozenxid changes somewhat; it no longer means that all XIDs older than frozenxid are replaced with FrozenXid. Instead, it will mean that all XIDs older than frozenxid are committed, ie. all dead tuples older than that have been vacuumed away. - Heikki
On 7 June 2013 19:08, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 07.06.2013 20:54, Robert Haas wrote: >> >> On Thu, Jun 6, 2013 at 6:28 PM, Greg Stark<stark@mit.edu> wrote: >>> >>> On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas >>> <hlinnakangas@vmware.com> wrote: >>>> >>>> That will keep OldestXmin from advancing. Which will keep vacuum from >>>> advancing relfrozenxid/datfrozenxid. Which will first trigger the >>>> warnings >>>> about wrap-around, then stops new XIDs from being generated, and finally >>>> a >>>> forced shutdown. >>>> >>>> The forced shutdown will actually happen some time before going beyond 2 >>>> billion XIDs. So it is not possible to have a long-lived transaction, >>>> older >>>> than 2 B XIDs, still live in the system. But let's imagine that you >>>> somehow >>>> bypass the safety mechanism: >>> >>> >>> Ah, so if you do the epoch in the page header thing or Robert's LSN >>> trick that I didn't follow then you'll need a new safety check against >>> this. Since relfrozenxid/datfrozenxid will no longer be necessary. >> >> >> Nothing proposed here gets rid of either of those, that I can see. > > > Right. The meaning of relfrozenxid/datfrozenxid changes somewhat; it no > longer means that all XIDs older than frozenxid are replaced with FrozenXid. > Instead, it will mean that all XIDs older than frozenxid are committed, ie. > all dead tuples older than that have been vacuumed away. Now that I consider Greg's line of thought, the idea we focused on here was about avoiding freezing. But Greg makes me think that we may also wish to look at allowing queries to run longer than one epoch as well, if the epoch wrap time is likely to come down substantially. To do that I think we'll need to hold epoch for relfrozenxid as well, amongst other things. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 07.06.2013 21:33, Simon Riggs wrote: > Now that I consider Greg's line of thought, the idea we focused on > here was about avoiding freezing. But Greg makes me think that we may > also wish to look at allowing queries to run longer than one epoch as > well, if the epoch wrap time is likely to come down substantially. > > To do that I think we'll need to hold epoch for relfrozenxid as well, > amongst other things. The biggest problem I see with that is that if a snapshot can be older than 2 billion XIDs, it must be possible to store XIDs on the same page that are more than 2 billion XIDs apart. All the discussed schemes where we store the epoch at the page level, either explicitly or derived from the LSN, rely on the fact that it's not currently necessary to do that. Currently, when one XID on a page is older than 2 billion XIDs, that old XID can always be replaced with FrozenXid, because there cannot be a snapshot old enough to not see it. I agree that it would be nice if you could find a way around that. You had a suggestion on making room on the tuple header for the epoch. I'm not sure I like that particular proposal, but we would need something like that. If we settle for snapshots that can be at most, say, 512 billion transactions old, instead of 2 billion, then we would only need one byte to store an epoch "offset" in the tuple header. Ie. deduce the epoch of tuples on the page from the LSN on the page header, but allow individual tuples to specify an offset from that deduced epoch. In practice, I think we're still quite far from people running into that 2 billion XID limit on snapshot age. But maybe in a few years, after we've solved all the more pressing vacuum and wraparound issues that people currently run into before reaching that stage... - Heikki
On 7 June 2013 19:56, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 07.06.2013 21:33, Simon Riggs wrote: >> >> Now that I consider Greg's line of thought, the idea we focused on >> here was about avoiding freezing. But Greg makes me think that we may >> also wish to look at allowing queries to run longer than one epoch as >> well, if the epoch wrap time is likely to come down substantially. >> >> To do that I think we'll need to hold epoch for relfrozenxid as well, >> amongst other things. > The biggest problem I see with that is that if a snapshot can be older than > 2 billion XIDs, it must be possible to store XIDs on the same page that are > more than 2 billion XIDs apart. All the discussed schemes where we store the > epoch at the page level, either explicitly or derived from the LSN, rely on > the fact that it's not currently necessary to do that. Currently, when one > XID on a page is older than 2 billion XIDs, that old XID can always be > replaced with FrozenXid, because there cannot be a snapshot old enough to > not see it. It does seem that there are two problems: avoiding freezing AND long running queries The long running query problem hasn't ever been looked at, it seems, until here and now. > I agree that it would be nice if you could find a way around that. You had a > suggestion on making room on the tuple header for the epoch. I'm not sure I > like that particular proposal, but we would need something like that. If we > settle for snapshots that can be at most, say, 512 billion transactions old, > instead of 2 billion, then we would only need one byte to store an epoch > "offset" in the tuple header. Ie. deduce the epoch of tuples on the page > from the LSN on the page header, but allow individual tuples to specify an > offset from that deduced epoch. I like the modification you propose. And I like it even better because it uses just 1 byte, which is even more easily squeezed into the existing tuple header, whether we go with my proposed squeezing route or not. > In practice, I think we're still quite far from people running into that 2 > billion XID limit on snapshot age. But maybe in a few years, after we've > solved all the more pressing vacuum and wraparound issues that people > currently run into before reaching that stage... Your WALInsert lock patch will fix that. ;-) --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 7, 2013 at 3:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > The long running query problem hasn't ever been looked at, it seems, > until here and now. For what it's worth (and that may not be much), I think most people will die a horrible death due to bloat after holding a transaction open for a tiny fraction of 2B XIDs. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-06-07 20:10:55 +0100, Simon Riggs wrote: > On 7 June 2013 19:56, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > > On 07.06.2013 21:33, Simon Riggs wrote: > >> > >> Now that I consider Greg's line of thought, the idea we focused on > >> here was about avoiding freezing. But Greg makes me think that we may > >> also wish to look at allowing queries to run longer than one epoch as > >> well, if the epoch wrap time is likely to come down substantially. > >> > >> To do that I think we'll need to hold epoch for relfrozenxid as well, > >> amongst other things. > > > The biggest problem I see with that is that if a snapshot can be older than > > 2 billion XIDs, it must be possible to store XIDs on the same page that are > > more than 2 billion XIDs apart. All the discussed schemes where we store the > > epoch at the page level, either explicitly or derived from the LSN, rely on > > the fact that it's not currently necessary to do that. Currently, when one > > XID on a page is older than 2 billion XIDs, that old XID can always be > > replaced with FrozenXid, because there cannot be a snapshot old enough to > > not see it. > > It does seem that there are two problems: avoiding freezing AND long > running queries > > The long running query problem hasn't ever been looked at, it seems, > until here and now. I'd say that's because it's prohibitive to run so long transactions anyway since it causes too much unremovable bloat. 2bio transactions really is a quite a bit, I don't think it's a relevant restriction. Yet. Let's discuss it if we have solved the other problems ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 7 June 2013 20:16, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-07 20:10:55 +0100, Simon Riggs wrote: >> On 7 June 2013 19:56, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> > On 07.06.2013 21:33, Simon Riggs wrote: >> >> >> >> Now that I consider Greg's line of thought, the idea we focused on >> >> here was about avoiding freezing. But Greg makes me think that we may >> >> also wish to look at allowing queries to run longer than one epoch as >> >> well, if the epoch wrap time is likely to come down substantially. >> >> >> >> To do that I think we'll need to hold epoch for relfrozenxid as well, >> >> amongst other things. >> >> > The biggest problem I see with that is that if a snapshot can be older than >> > 2 billion XIDs, it must be possible to store XIDs on the same page that are >> > more than 2 billion XIDs apart. All the discussed schemes where we store the >> > epoch at the page level, either explicitly or derived from the LSN, rely on >> > the fact that it's not currently necessary to do that. Currently, when one >> > XID on a page is older than 2 billion XIDs, that old XID can always be >> > replaced with FrozenXid, because there cannot be a snapshot old enough to >> > not see it. >> >> It does seem that there are two problems: avoiding freezing AND long >> running queries >> >> The long running query problem hasn't ever been looked at, it seems, >> until here and now. > > I'd say that's because it's prohibitive to run so long transactions > anyway since it causes too much unremovable bloat. 2bio transactions > really is a quite a bit, I don't think it's a relevant restriction. Yet. > > Let's discuss it if we have solved the other problems ;) Let me say that I think that problem is solvable also. At the moment we allow all visible tuple versions to be linked together, so that the last visible and latest update are linked by a chain. If we break that assumption and say that we will never follow an update chain from a snapshot in the distant past, then we can remove intermediate dead rows. We currently regard those as recently dead, but that just requires some extra thought. We still keep all *visible* tuple versions, we just don't bother to keep all the intermediate ones as well. Perhaps another day, but one day. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Avoiding bloat in the presence of a long-running transaction (Re: Freezing without write I/O)
From
Heikki Linnakangas
Date:
On 07.06.2013 22:15, Robert Haas wrote: > On Fri, Jun 7, 2013 at 3:10 PM, Simon Riggs<simon@2ndquadrant.com> wrote: >> The long running query problem hasn't ever been looked at, it seems, >> until here and now. > > For what it's worth (and that may not be much), I think most people > will die a horrible death due to bloat after holding a transaction > open for a tiny fraction of 2B XIDs. :-( Yeah, we should fix that too ;-). While we're at it: I've been thinking that we should try harder to vacuum dead tuples that are no longer visible to any snapshot, even if there's an even old snapshot. The stereotypical scenario is a table with a single row that's updated very very frequently. Like a counter. Autovacuum can normally keep it in check, but if you have a long-running transaction, it will bloat horrendously. But if you only have one long-running transaction with one really old snapshot, and everything else is recent, you'd really only need to keep one old tuple around for the old snapshot to see, and a recent version or two for the rest. At worst, the database needs to bloat to double the size, but not more than that. To know which tuples are dead at such a fine-grained level, vacuum would need to know in more detail what snapshots the backends have. I'm really excited about Ants Aasma's proposal to use a CSN for snapshots, or more precisely the variant using commit record's LSN for that. If a snapshot is just a single integer, it becomes easier for backends to share their snapshots, in limited amount of shared memory. - Heikki
On 06/07/2013 08:56 PM, Heikki Linnakangas wrote: > On 07.06.2013 21:33, Simon Riggs wrote: >> Now that I consider Greg's line of thought, the idea we focused on >> here was about avoiding freezing. But Greg makes me think that we may >> also wish to look at allowing queries to run longer than one epoch as >> well, if the epoch wrap time is likely to come down substantially. >> >> To do that I think we'll need to hold epoch for relfrozenxid as well, >> amongst other things. > > The biggest problem I see with that is that if a snapshot can be older > than 2 billion XIDs, it must be possible to store XIDs on the same > page that are more than 2 billion XIDs apart. It could be possible to recognise the situation and save the new XIDs on *another* page ? -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 06/07/2013 09:16 PM, Andres Freund wrote: > On 2013-06-07 20:10:55 +0100, Simon Riggs wrote: >> On 7 June 2013 19:56, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >>> On 07.06.2013 21:33, Simon Riggs wrote: >>>> Now that I consider Greg's line of thought, the idea we focused on >>>> here was about avoiding freezing. But Greg makes me think that we may >>>> also wish to look at allowing queries to run longer than one epoch as >>>> well, if the epoch wrap time is likely to come down substantially. >>>> >>>> To do that I think we'll need to hold epoch for relfrozenxid as well, >>>> amongst other things. >>> The biggest problem I see with that is that if a snapshot can be older than >>> 2 billion XIDs, it must be possible to store XIDs on the same page that are >>> more than 2 billion XIDs apart. All the discussed schemes where we store the >>> epoch at the page level, either explicitly or derived from the LSN, rely on >>> the fact that it's not currently necessary to do that. Currently, when one >>> XID on a page is older than 2 billion XIDs, that old XID can always be >>> replaced with FrozenXid, because there cannot be a snapshot old enough to >>> not see it. >> It does seem that there are two problems: avoiding freezing AND long >> running queries >> >> The long running query problem hasn't ever been looked at, it seems, >> until here and now. > I'd say that's because it's prohibitive to run so long transactions > anyway since it causes too much unremovable bloat. 2bio transactions > really is a quite a bit, I don't think it's a relevant restriction. Yet. 2bio transaction means at least 2G rows inserted, updated or deleted, which may or may not bee "too much" In my simple key/value update tests I was able to do 40k single transaction updates sec on moderate sized server. at this rate 2G trx takes about 15 hours which is not really very much. One may for example have an unfinished 2PC transaction lingering over weekend and while of course one should have monitoring in place to avoid this, it still would be nice if this did not cause database shutdown. -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 01.06.2013 23:21, Robert Haas wrote: > On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >>> We define a new page-level bit, something like PD_RECENTLY_FROZEN. >>> When this bit is set, it means there are no unfrozen tuples on the >>> page with XIDs that predate the current half-epoch. Whenever we know >>> this to be true, we set the bit. If the page LSN crosses more than >>> one half-epoch boundary at a time, we freeze the page and set the bit. >>> If the page LSN crosses exactly one half-epoch boundary, then (1) if >>> the bit is set, we clear it and (2) if the bit is not set, we freeze >>> the page and set the bit. >> >> Yep, I think that would work. Want to write the patch, or should I? ;-) > > Have at it. Here's a first draft. A lot of stuff missing and broken, but "make check" passes :-). In the patch, instead of working with "half-epochs", there are "XID-LSN ranges", which can be of arbitrary size. An XID-LSN range consists of three values: minlsn: The point in WAL where this range begins. minxid - maxxid: The range of XIDs allowed in this range. Every point in WAL belongs to exactly one range. The minxid-maxxid of the ranges can overlap. For example: 1. XIDs 25000942 - 27000003 LSN 0/3BB9938 2. XIDs 23000742 - 26000003 LSN 0/2AB9288 3. XIDs 22000721 - 25000003 LSN 0/1AB8BE0 4. XIDs 22000002 - 24000003 LSN 0/10B1550 The invariant with the ranges is that a page with a certain LSN is only allowed to contain XIDs that belong to the range specified by that LSN. For example, if a page has LSN 0/3500000, it belongs to the 2nd range, and can only contain XIDs between 23000742 - 26000003. If a backend updates the page, so that it's LSN is updated to, say, 0/3D12345, all XIDs on the page older than 25000942 must be frozen first, to avoid violating the rule. The system keeps track of a small number of these XID-LSN ranges. Where we currently truncate clog, we can also truncate the ranges with maxxid < the clog truncation point. Vacuum removes any dead tuples and updates relfrozenxid as usual, to make sure that there are no dead tuples or aborted XIDs older than the minxid of the oldest tracked XID-LSN range. It no longer needs to freeze old committed XIDs, however - that's the gain from this patch (except to uphold the invariant, if it has to remove some dead tuples on the page and update its LSN). A new range is created whenever we reach the maxxid on the current one. The new range's minxid is set to the current global oldest xmin value, and maxxid is just the old maxxid plus a fixed number (1 million in the patch, but we probably want a higher value in reality). This ensures that if you modify a page and update its LSN, all the existing XIDs on the page that cannot be frozen yet are greater than the minxid of the latest range. In other words, you can always freeze old XIDs on a page, so that any remaining non-frozen XIDs are within the minxid-maxxid of the latest range. The HeapTupleSatisfies functions are modified to look at the page's LSN first. If it's old enough, it doesn't look at the XIDs on the page level at all, and just considers everything on the page is visible to everyone (I'm calling this state a "mature page"). > I think the tricky part is going to be figuring out the > synchronization around half-epoch boundaries. Yep. I skipped all those difficult parts in this first version. There are two race conditions that need to be fixed: 1. When a page is updated, we check if it needs to be frozen. If its LSN is greater than the latest range's LSN. IOW, if we've already modified the page, and thus frozen all older tuples, within the current range. However, it's possible that a new range is created immediately after we've checked that. When we then proceed to do the actual update on the page and WAL-log that, the new LSN falls within the next range, and we should've frozen the page. I'm planning to fix that by adding a "parity bit" on the page header. Each XID-LSN range is assigned a parity bit, 0 or 1. When we check if a page needs to be frozen on update, we make note of the latest range's parity bit, and write it in the page header. Later, when we look at the page's LSN to determine which XID-LSN range it belongs to, we compare the parity. If the parity doesn't match, we know that the race condition happened, so we treat the page to belong to the previous range, not the one it would normally belong to, per the LSN. 2. When we look at a page, and determine that it's not old enough to be "matured", we then check the clog as usual. A page is considered mature, if the XID-LSN range (and corresponding clog pages) has already been truncated away. It's possible that between those steps, the XID-LSN range and clog is truncated away, so that the backend tries to access a clog page that doesn't exist anymore. To fix that, the XID-LSN range and clog truncation needs to be done in two steps. First, mark the truncation point in shared memory. Then somehow wait until all backends see the new value, and go ahead with actually truncating the clog only after that. Aside from those two race conditions, there are plenty of scalability issues remaining. Currently, the shared XID-LSN range array is checked every time a page is accessed, so it could quickly become a bottleneck. Need to cache that information in each backend. Oh, and I didn't implement the PD_RECENTLY_FROZEN bit in the page header yet, so you will get a freezing frenzy right after a new XID-LSN range is created. I'll keep hacking away on those things, but please let me know if you see some fatal flaw with this plan. - Heikki
Attachment
On 10 June 2013 19:58, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 01.06.2013 23:21, Robert Haas wrote: >> >> On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>>> >>>> We define a new page-level bit, something like PD_RECENTLY_FROZEN. >>>> When this bit is set, it means there are no unfrozen tuples on the >>>> page with XIDs that predate the current half-epoch. Whenever we know >>>> this to be true, we set the bit. If the page LSN crosses more than >>>> one half-epoch boundary at a time, we freeze the page and set the bit. >>>> If the page LSN crosses exactly one half-epoch boundary, then (1) if >>>> the bit is set, we clear it and (2) if the bit is not set, we freeze >>>> the page and set the bit. >>> >>> >>> Yep, I think that would work. Want to write the patch, or should I? ;-) >> >> >> Have at it. > > > Here's a first draft. A lot of stuff missing and broken, but "make check" > passes :-). Well done, looks like good progress. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jun 10, 2013 at 4:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Well done, looks like good progress. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10.06.2013 21:58, Heikki Linnakangas wrote: > On 01.06.2013 23:21, Robert Haas wrote: >> On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>>> We define a new page-level bit, something like PD_RECENTLY_FROZEN. >>>> When this bit is set, it means there are no unfrozen tuples on the >>>> page with XIDs that predate the current half-epoch. Whenever we know >>>> this to be true, we set the bit. If the page LSN crosses more than >>>> one half-epoch boundary at a time, we freeze the page and set the bit. >>>> If the page LSN crosses exactly one half-epoch boundary, then (1) if >>>> the bit is set, we clear it and (2) if the bit is not set, we freeze >>>> the page and set the bit. >>> >>> Yep, I think that would work. Want to write the patch, or should I? ;-) >> >> Have at it. > > Here's a first draft. A lot of stuff missing and broken, but "make > check" passes :-). > > In the patch, instead of working with "half-epochs", there are "XID-LSN > ranges", which can be of arbitrary size. An XID-LSN range consists of > three values: > > minlsn: The point in WAL where this range begins. > minxid - maxxid: The range of XIDs allowed in this range. > > Every point in WAL belongs to exactly one range. The minxid-maxxid of > the ranges can overlap. For example: > > 1. XIDs 25000942 - 27000003 LSN 0/3BB9938 > 2. XIDs 23000742 - 26000003 LSN 0/2AB9288 > 3. XIDs 22000721 - 25000003 LSN 0/1AB8BE0 > 4. XIDs 22000002 - 24000003 LSN 0/10B1550 > > The invariant with the ranges is that a page with a certain LSN is only > allowed to contain XIDs that belong to the range specified by that LSN. > For example, if a page has LSN 0/3500000, it belongs to the 2nd range, > and can only contain XIDs between 23000742 - 26000003. If a backend > updates the page, so that it's LSN is updated to, say, 0/3D12345, all > XIDs on the page older than 25000942 must be frozen first, to avoid > violating the rule. > > The system keeps track of a small number of these XID-LSN ranges. Where > we currently truncate clog, we can also truncate the ranges with maxxid > < the clog truncation point. Vacuum removes any dead tuples and updates > relfrozenxid as usual, to make sure that there are no dead tuples or > aborted XIDs older than the minxid of the oldest tracked XID-LSN range. > It no longer needs to freeze old committed XIDs, however - that's the > gain from this patch (except to uphold the invariant, if it has to > remove some dead tuples on the page and update its LSN). > > A new range is created whenever we reach the maxxid on the current one. > The new range's minxid is set to the current global oldest xmin value, > and maxxid is just the old maxxid plus a fixed number (1 million in the > patch, but we probably want a higher value in reality). This ensures > that if you modify a page and update its LSN, all the existing XIDs on > the page that cannot be frozen yet are greater than the minxid of the > latest range. In other words, you can always freeze old XIDs on a page, > so that any remaining non-frozen XIDs are within the minxid-maxxid of > the latest range. > > The HeapTupleSatisfies functions are modified to look at the page's LSN > first. If it's old enough, it doesn't look at the XIDs on the page level > at all, and just considers everything on the page is visible to everyone > (I'm calling this state a "mature page"). > >> I think the tricky part is going to be figuring out the >> synchronization around half-epoch boundaries. > > Yep. I skipped all those difficult parts in this first version. There > are two race conditions that need to be fixed: > > 1. When a page is updated, we check if it needs to be frozen. If its LSN > is greater than the latest range's LSN. IOW, if we've already modified > the page, and thus frozen all older tuples, within the current range. > However, it's possible that a new range is created immediately after > we've checked that. When we then proceed to do the actual update on the > page and WAL-log that, the new LSN falls within the next range, and we > should've frozen the page. I'm planning to fix that by adding a "parity > bit" on the page header. Each XID-LSN range is assigned a parity bit, 0 > or 1. When we check if a page needs to be frozen on update, we make note > of the latest range's parity bit, and write it in the page header. > Later, when we look at the page's LSN to determine which XID-LSN range > it belongs to, we compare the parity. If the parity doesn't match, we > know that the race condition happened, so we treat the page to belong to > the previous range, not the one it would normally belong to, per the LSN. > > 2. When we look at a page, and determine that it's not old enough to be > "matured", we then check the clog as usual. A page is considered mature, > if the XID-LSN range (and corresponding clog pages) has already been > truncated away. It's possible that between those steps, the XID-LSN > range and clog is truncated away, so that the backend tries to access a > clog page that doesn't exist anymore. To fix that, the XID-LSN range and > clog truncation needs to be done in two steps. First, mark the > truncation point in shared memory. Then somehow wait until all backends > see the new value, and go ahead with actually truncating the clog only > after that. > > Aside from those two race conditions, there are plenty of scalability > issues remaining. Currently, the shared XID-LSN range array is checked > every time a page is accessed, so it could quickly become a bottleneck. > Need to cache that information in each backend. Oh, and I didn't > implement the PD_RECENTLY_FROZEN bit in the page header yet, so you will > get a freezing frenzy right after a new XID-LSN range is created. > > I'll keep hacking away on those things, but please let me know if you > see some fatal flaw with this plan. Here's an updated patch. The race conditions I mentioned above have been fixed. This is still definitely work-in-progress, but overall I think this is quite promising. The patch basically works, although there are a bunch of TODO items like handling unlogged tables. This patch is also available in my git repository at git://git.postgresql.org/git/users/heikki/postgres.git, branch "freeze-by-xid-lsn-ranges". - Heikki
Attachment
On 27.08.2013 18:56, Heikki Linnakangas wrote: > Here's an updated patch. Ah, forgot one thing: Here's a little extension I've been using to test this. It contains two functions; one to simply consume N xids, making it faster to hit the creation of new XID ranges and wraparound. The other, print_xidlsnranges(), prints out the contents of the current XID-LSN range array. Also, just ran into two silly bugs in the patch version I posted, while checking that that xidtest extension hasn't bitrotted. A fixed version has been pushed to the git repository, so make sure you use that version if you want to actually run it. - Heikki
Attachment
Hi Heikki, On 2013-08-27 18:56:15 +0300, Heikki Linnakangas wrote: > Here's an updated patch. The race conditions I mentioned above have been > fixed. Thanks for posting the new version! I have a quick question: The reason I'd asked about the status of the patch was that I was thinking about the state of the "forensic freezing" patch. After a quick look at your proposal, we still need to freeze in some situations (old & new data on the same page basically), so I'd say it still makes sense to apply the forensic freezing patch, right? Differing Opinions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, 2013-08-30 at 20:34 +0200, Andres Freund wrote: > I have a quick question: The reason I'd asked about the status of the > patch was that I was thinking about the state of the "forensic freezing" > patch. After a quick look at your proposal, we still need to freeze in > some situations (old & new data on the same page basically), so I'd say > it still makes sense to apply the forensic freezing patch, right? > > Differing Opinions? The Freeze Forensically patch is nice because (if I understand it correctly) it allows us to freeze at the same time as we mark PD_ALL_VISIBLE, which avoids the potential extra page write. But that's not such a big advantage if we don't ordinarily have to write again for freezing, anyway. However, there are still some cases where we'd be able to preserve the forensic information. If nothing else, that might help debug this patch, if necessary. There might also be cases where we can freeze more eagerly to avoid the case where very old (but unfrozen) and very new tuples mix on the same page. Perhaps Robert has some thoughts here, as well. Regards,Jeff Davis
On Mon, Sep 2, 2013 at 3:16 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2013-08-30 at 20:34 +0200, Andres Freund wrote: >> I have a quick question: The reason I'd asked about the status of the >> patch was that I was thinking about the state of the "forensic freezing" >> patch. After a quick look at your proposal, we still need to freeze in >> some situations (old & new data on the same page basically), so I'd say >> it still makes sense to apply the forensic freezing patch, right? >> >> Differing Opinions? > > The Freeze Forensically patch is nice because (if I understand it > correctly) it allows us to freeze at the same time as we mark > PD_ALL_VISIBLE, which avoids the potential extra page write. The patch itself doesn't actually make that change, but it removes one major objection to such a change. > But that's > not such a big advantage if we don't ordinarily have to write again for > freezing, anyway. That was my thought as well. > However, there are still some cases where we'd be able to preserve the > forensic information. If nothing else, that might help debug this patch, > if necessary. There might also be cases where we can freeze more eagerly > to avoid the case where very old (but unfrozen) and very new tuples mix > on the same page. Perhaps Robert has some thoughts here, as well. I basically agree. I think if we adopt Heikki's patch forensic freezing becomes much less important, but we might find there's still a reason to do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 27.08.2013 19:37, Heikki Linnakangas wrote: > On 27.08.2013 18:56, Heikki Linnakangas wrote: >> Here's an updated patch. > > Ah, forgot one thing: > > Here's a little extension I've been using to test this. It contains two > functions; one to simply consume N xids, making it faster to hit the > creation of new XID ranges and wraparound. The other, > print_xidlsnranges(), prints out the contents of the current XID-LSN > range array. > > Also, just ran into two silly bugs in the patch version I posted, while > checking that that xidtest extension hasn't bitrotted. A fixed version > has been pushed to the git repository, so make sure you use that version > if you want to actually run it. Here's a rebased version of the patch, including the above-mentioned fixes. Nothing else new. - Heikki -- - Heikki
Attachment
On 9/16/13 9:59 AM, Heikki Linnakangas wrote: > Here's a rebased version of the patch, including the above-mentioned > fixes. Nothing else new. It still fails to apply. You might need to rebase it again.
On 2013-09-17 09:41:47 -0400, Peter Eisentraut wrote: > On 9/16/13 9:59 AM, Heikki Linnakangas wrote: > > Here's a rebased version of the patch, including the above-mentioned > > fixes. Nothing else new. > > It still fails to apply. You might need to rebase it again. FWIW, I don't think it's too important that this specific patch applies all the time. From the look I had and the discussions onlist, what it needs so far is mostly architecture review and such. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Heikki Linnakangas escribió: > Here's a rebased version of the patch, including the above-mentioned > fixes. Nothing else new. Nice work. I apologize for the conflicts I created yesterday. I would suggest renaming varsup_internal.h to varsup_xlog.h. You added a FIXME comment to HeapTupleHeaderIsOnlyLocked. I think the answer to the question is yes, because checking for locked might incur examining the Xmax of tuples, which cannot be done if the page is mature; perhaps the check for maturity needs to happen only after the infomask has been checked. The new stuff in GetNewTransactionId() is said to be "very expensive" (it might involve grabbing the ProcArrayLock and scanning the whole procarray, for example.) There's a comment about having this be done only in other processes, and I think that's a good idea, otherwise we risk adding a lot of latency, randomly, to client-connected processes which might be latency sensitive. I think autovacuum is not a good choice however (it might not even be running). How about using the bgwriter or walwriter for this? As far as I can tell, there's no need for this process to actually be able to run transactions or scan catalogs; the ability to lock and scan ProcArray appears sufficient. Using a RTC (instead of the Xid counter % 128) to determine when to run this doesn't look like a problem to me. Something that sleeps for too long might be, so we would need to ensure that it happens timely. Not sure what's an appropriate time for this, however. Another option is have backends check the Xid counter, and every 128 ticks set a flag in shmem so that the background process knows to execute the switch. heap_freeze_tuple() receives one Xid and one MultiXactId; they can be passed as Invalid to mean forced freezing. Do we really need to offer the possibility of freezing xids but not multis, and multis but not xids? From a conceptual PoV, it seems to make more sense to me to pass a boolean "force" meaning freeze everything, and ignore the numerical values. The greatest risk in this patch is the possibility that some heap_freeze_page() call is forgotten. I think you covered all cases in heapam.c. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-09-16 16:59:28 +0300, Heikki Linnakangas wrote: > Here's a rebased version of the patch, including the above-mentioned fixes. > Nothing else new. * We need some higherlevel description of the algorithm somewhere in the source. I don't think I've understood the conceptfrom the patch alone without having read the thread previously. * why do we need to do the PageUpdateNeedsFreezing() dance in heap_page_prune? No xids should change during it. * Why can we do a GetOldestXmin(allDbs = false) in BeginXidLSNRangeSwitch()? * Is there any concrete reasoning behind the current values for XID_LSN_RANGE_INTERVAL and NUM_XID_LSN_RANGES or just gutfeeling? * the lsn ranges file can possibly become bigger than 512bytes (the size we assume to be written atomically) and you writeit inplace. If we fail halfway through writing, we seem to be able to recover by using the pageMatureLSN from the lastcheckpoint, but it seems better to do the fsync(),rename(),fsync() dance either way. * Should we preemptively freeze tuples on a page in lazy_scan_heap if we already have dirtied the page? That would make futuremodifcations cheaper. * lazy_scan_heap now blocks acquiring a cleanup lock on every buffer that contains dead tuples. Shouldn't we use some kindof cutoff xid there? That might block progress too heavily. Also the comment above it still refers to the old logic. * There's no way to force a full table vacuum anymore, that seems problematic to me. * I wonder if CheckPointVarsup() doesn't need to update minRecoveryPoint. StartupVarsup() should be ok, because we shouldonly read one from the future during a basebackup? * xidlsnranges_recently[_dirtied] are not obvious on a first glance. Why can't we just reset dirty before the WriteXidLSNRangesFile()call? There's only one process doing the writeout. Just because the checkpointing process could bekilled? * I think we should either not require consuming an multixactid or use a function that doesn't need MultiXactIdSetOldestMember().If the transaction doing so lives for long it will unnecessarily prevent truncation of mxacts. * switchFinishXmin and nextSwitchXid should probably be either volatile or have a compiler barrier between accessing sharedmemory and checking them. The compiler very well could optimize them away and access shmem all the time which couldlead to weird results. * I wonder whether the fact that we're doing the range switches after acquiring an xid could be problematic if we're preventingxid allocation due to the checks earlier in that function? * I think heap_lock_tuple() needs to unset all-visible, otherwise we won't vacuum that page again which can lead to problemssince we don't do full-table vacuums again? So, I think that's enough for a first look. Will think about general issues a bit more. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Sep 16, 2013 at 6:59 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
Here's a rebased version of the patch, including the above-mentioned fixes. Nothing else new.
I've applied this to 0892ecbc015930d, the last commit to which it applies cleanly.
When I test this by repeatedly incrementing a counter in a randomly chosen row, then querying the whole table and comparing the results to what my driver knows they should be, I get discrepancies.
No crash/recovery needs to be done to get the behavior.
The number of rows is correct, so one version of every row is visible, but it is the wrong version.
The discrepancy arises shortly after the first time this type of message appears:
6930 UPDATE 2013-09-18 12:36:34.519 PDT:LOG: started new XID range, XIDs 1000033-, MultiXIDs 1-, tentative LSN 0/FA517F8
6930 UPDATE 2013-09-18 12:36:34.519 PDT:STATEMENT: update foo set count=count+1 where index=$1
6928 UPDATE 2013-09-18 12:36:34.521 PDT:LOG: closed old XID range at 1000193 (LSN 0/FA58A08)
6928 UPDATE 2013-09-18 12:36:34.521 PDT:STATEMENT: update foo set count=count+1 where index=$1
I'll work on getting the driver to shutdown the database the first time it finds a problem so that autovac doesn't destroy evidence.
Cheers,
Jeff
On Wed, Sep 18, 2013 at 12:55 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Sep 16, 2013 at 6:59 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:Here's a rebased version of the patch, including the above-mentioned fixes. Nothing else new.I've applied this to 0892ecbc015930d, the last commit to which it applies cleanly.When I test this by repeatedly incrementing a counter in a randomly chosen row, then querying the whole table and comparing the results to what my driver knows they should be, I get discrepancies.No crash/recovery needs to be done to get the behavior.The number of rows is correct, so one version of every row is visible, but it is sometimes the wrong version.The discrepancy arises shortly after the first time this type of message appears:6930 UPDATE 2013-09-18 12:36:34.519 PDT:LOG: started new XID range, XIDs 1000033-, MultiXIDs 1-, tentative LSN 0/FA517F86930 UPDATE 2013-09-18 12:36:34.519 PDT:STATEMENT: update foo set count=count+1 where index=$16928 UPDATE 2013-09-18 12:36:34.521 PDT:LOG: closed old XID range at 1000193 (LSN 0/FA58A08)6928 UPDATE 2013-09-18 12:36:34.521 PDT:STATEMENT: update foo set count=count+1 where index=$1I'll work on getting the driver to shutdown the database the first time it finds a problem so that autovac doesn't destroy evidence.
I have uploaded the script to reproduce, and a tarball of the data directory (when started, it will go through recovery. table "foo" is in the jjanes database and role.)
The row with index=8499 should have count of 8, but really has count of 4, and is only findable by seq scan, there is no such row by index scan.
select ctid,* from foo where index=8499;
select ctid,* from foo where index+0=8499;
select * from heap_page_items(get_raw_page('foo',37)) where lp=248 \x\g\x
Expanded display is on.
-[ RECORD 1 ]---------
lp | 248
lp_off | 8160
lp_flags | 1
lp_len | 32
t_xmin | 2
t_xmax | 0
t_field3 | 0
t_ctid | (37,248)
t_infomask2 | 32770
t_infomask | 10496
t_hoff | 24
t_bits |
t_oid |
So the xmax is 0 when it really should not be.
What I really want to do is find the not-visible ctids which would have 8499 for index, but I don't know how to do that.
Cheers,
Jeff
On 18.09.2013 16:22, Andres Freund wrote: > On 2013-09-16 16:59:28 +0300, Heikki Linnakangas wrote: >> Here's a rebased version of the patch, including the above-mentioned fixes. >> Nothing else new. > > * We need some higherlevel description of the algorithm somewhere in the > source. I don't think I've understood the concept from the patch alone > without having read the thread previously. > * why do we need to do the PageUpdateNeedsFreezing() dance in > heap_page_prune? No xids should change during it. Ah, but its LSN does change. Moving the LSN forward might move the LSN from one xid-lsn range to another, requiring any XIDs on the page that fall outside the xid range of the new xid-lsn range to be frozen. > * Why can we do a GetOldestXmin(allDbs = false) in > BeginXidLSNRangeSwitch()? Looks like a bug. I think I got the arguments backwards, was supposed to be allDbs = true and ignoreVacuum = false. I'm not sure if vacuums could be ignored, but 'false' is the safe option. > * Is there any concrete reasoning behind the current values for > XID_LSN_RANGE_INTERVAL and NUM_XID_LSN_RANGES or just gut feeling? The values in the patch are: #define NUM_XID_LSN_RANGES 100 #define XID_LSN_RANGE_INTERVAL 1000000 The interval is probably way too small for production, but a small value is handy for testing, as you can get through ranges faster. Something like 100 million would probably be a more suitable value for production. With a smaller interval, you need to freeze more often, while with a large interval, you can't truncate the clog as early. NUM_XID_LSN_RANGES is also quite arbitrary. I don't have a good feeling on what the appropriate sizing for that would be. Something like 5 or 10 would probably be enough. Although at the moment, the patch doesn't handle the situation that you run out of slots very well. You could merge some old ranges to make room, but ATM, it just stops creating new ones. > * the lsn ranges file can possibly become bigger than 512bytes (the size > we assume to be written atomically) and you write it inplace. If we > fail halfway through writing, we seem to be able to recover by using > the pageMatureLSN from the last checkpoint, but it seems better to > do the fsync(),rename(),fsync() dance either way. The lsn-range array is also written to the xlog in toto whenever it changes, so on recovery, the ranges will be read from the WAL, and the ranges file will be recreated at the end-of-recovery checkpoint. > * Should we preemptively freeze tuples on a page in lazy_scan_heap if we > already have dirtied the page? That would make future modifcations > cheaper. Hmm, dunno. Any future modification will also dirty the page, so you're not actually saving any I/O by freezing it earlier. You're just choosing to do the CPU work and WAL insertion earlier than you have to. And if the page is not modified later, it is a waste of cycles. That said, maybe it would indeed be better to do it in vacuum when you have a chance to reduce latency in the critical path, even if it might be a waste. > * lazy_scan_heap now blocks acquiring a cleanup lock on every buffer > that contains dead tuples. Shouldn't we use some kind of cutoff xid > there? That might block progress too heavily. Also the comment above > it still refers to the old logic. Hmm, you mean like skip the page even if there are some dead tuples on it, as long as the dead tuples are not older than X xids? I guess we could do that. Or the other thing mentioned in the comments, ie. remember the page and come back to it later. For now though, I think it's good enough as it is. > * There's no way to force a full table vacuum anymore, that seems > problematic to me. Yeah, it probably would be good to have that. An "ignore visibility map" option. > * I wonder if CheckPointVarsup() doesn't need to update > minRecoveryPoint. Hmm, I guess it should. > StartupVarsup() should be ok, because we should only > read one from the future during a basebackup? You might read one from the future during recovery, whether it's crash or archive recovery, but as soon as you've replayed up to the min recovery point, or the end of WAL on crash recovery, the XID ranges array in memory should be consistent with the rest of the system, because changes to the array are WAL logged. > * xidlsnranges_recently[_dirtied] are not obvious on a first glance. Why > can't we just reset dirty before the WriteXidLSNRangesFile() call? > There's only one process doing the writeout. Just because the > checkpointing process could be killed? Right, if the write fails, you need to retry on the next checkpoint. > * I think we should either not require consuming an multixactid or use a > function that doesn't need MultiXactIdSetOldestMember(). If the > transaction doing so lives for long it will unnecessarily prevent > truncation of mxacts. Agreed, that was just a quick kludge to get it working. > * switchFinishXmin and nextSwitchXid should probably be either volatile > or have a compiler barrier between accessing shared memory and > checking them. The compiler very well could optimize them away and > access shmem all the time which could lead to weird results. Hmm, that would be a weird "optimization". Is that really legal for the compiler to do? Better safe than sorry, I guess. > * I wonder whether the fact that we're doing the range switches after > acquiring an xid could be problematic if we're preventing xid > allocation due to the checks earlier in that function? You mean that you might get into a situation where you cannot assign a new XID because you've reached xidStopLimit, and you cannot advance xidStopLimit because you can't switch to a new xid-lsn range, because no new XIDs are being assigned? Yeah, that would be nasty. The range management stuff should really be moved out of GetNewTransaction(), maybe do them in walwriter or bgwriter as Alvaro suggested. > * I think heap_lock_tuple() needs to unset all-visible, otherwise we > won't vacuum that page again which can lead to problems since we > don't do full-table vacuums again? It's OK if the page is never vacuumed again, the whole point of the patch is that old XIDs can be just left lingering in the table. The next time the page is updated, perhaps to lock a tuple again, the page will be frozen and the old xmax will be cleared. > So, I think that's enough for a first look. Will think about general > issues a bit more. Thanks! - Heikki
Hi, On 2013-09-19 14:42:19 +0300, Heikki Linnakangas wrote: > On 18.09.2013 16:22, Andres Freund wrote: > >* Why can we do a GetOldestXmin(allDbs = false) in > > BeginXidLSNRangeSwitch()? > > Looks like a bug. I think I got the arguments backwards, was supposed to be > allDbs = true and ignoreVacuum = false. I'm not sure if vacuums could be > ignored, but 'false' is the safe option. Not sure either... > >* the lsn ranges file can possibly become bigger than 512bytes (the size > > we assume to be written atomically) and you write it inplace. If we > > fail halfway through writing, we seem to be able to recover by using > > the pageMatureLSN from the last checkpoint, but it seems better to > > do the fsync(),rename(),fsync() dance either way. > > The lsn-range array is also written to the xlog in toto whenever it changes, > so on recovery, the ranges will be read from the WAL, and the ranges file > will be recreated at the end-of-recovery checkpoint. But we will read the file before we read the WAL record covering it, won't we? > >* Should we preemptively freeze tuples on a page in lazy_scan_heap if we > > already have dirtied the page? That would make future modifcations > > cheaper. > Hmm, dunno. Any future modification will also dirty the page, so you're not > actually saving any I/O by freezing it earlier. You're just choosing to do > the CPU work and WAL insertion earlier than you have to. And if the page is > not modified later, it is a waste of cycles. That said, maybe it would > indeed be better to do it in vacuum when you have a chance to reduce latency > in the critical path, even if it might be a waste. Well, if we already dirtied the page (as in vacuum itself, to remove dead tuples) we already have to do WAL. True it's a separate record, but that should probably be fixed. > >* lazy_scan_heap now blocks acquiring a cleanup lock on every buffer > > that contains dead tuples. Shouldn't we use some kind of cutoff xid > > there? That might block progress too heavily. Also the comment above > > it still refers to the old logic. > > Hmm, you mean like skip the page even if there are some dead tuples on it, > as long as the dead tuples are not older than X xids? I guess we could do > that. Or the other thing mentioned in the comments, ie. remember the page > and come back to it later. > > For now though, I think it's good enough as it is. I have seen vacuum being slowed down considerably because we couldn't acquire a cleanup lock. It's not that infrequent to have pages that are pinned most of the time. And that was when we only acquired the cleanup lock when necessary for freezing. Not processing the page will not mark it as all-visible which basically is remembering it... > >* switchFinishXmin and nextSwitchXid should probably be either volatile > > or have a compiler barrier between accessing shared memory and > > checking them. The compiler very well could optimize them away and > > access shmem all the time which could lead to weird results. > > Hmm, that would be a weird "optimization". Is that really legal for the > compiler to do? Better safe than sorry, I guess. I think it is. The compiler doesn't know anything about shared memory or threads, so it doesn't see that those parameters can change. It will be forced to load them as soon as an non-inlined function is called which actually might make it safe in this case - unless you compile in LTO mode where it recognizes that TransactionIdFollowsOrEquals() won't modify thigns... > >* I wonder whether the fact that we're doing the range switches after > > acquiring an xid could be problematic if we're preventing xid > > allocation due to the checks earlier in that function? > > You mean that you might get into a situation where you cannot assign a new > XID because you've reached xidStopLimit, and you cannot advance xidStopLimit > because you can't switch to a new xid-lsn range, because no new XIDs are > being assigned? Yeah, that would be nasty. The range management stuff should > really be moved out of GetNewTransaction(), maybe do them in walwriter or > bgwriter as Alvaro suggested. Yes, I think we should do that. I am not sure yet where to put it best though. Doing it in the wal writer doesn't seem to be a good idea, doing more there will increase latency for normal backends. > >* I think heap_lock_tuple() needs to unset all-visible, otherwise we > > won't vacuum that page again which can lead to problems since we > > don't do full-table vacuums again? > > It's OK if the page is never vacuumed again, the whole point of the patch is > that old XIDs can be just left lingering in the table. The next time the > page is updated, perhaps to lock a tuple again, the page will be frozen and > the old xmax will be cleared. Yes, you're right, it should be possible to make it work that way. But currently, we look at xmax and infomask of a tuple in heap_lock_tuple() *before* the PageUpdateNeedsFreezing() call. Currently we would thus create a new multixact with long dead xids as members and such. Or am I missing something? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-09-19 14:40:35 +0200, Andres Freund wrote: > > >* I think heap_lock_tuple() needs to unset all-visible, otherwise we > > > won't vacuum that page again which can lead to problems since we > > > don't do full-table vacuums again? > > > > It's OK if the page is never vacuumed again, the whole point of the patch is > > that old XIDs can be just left lingering in the table. The next time the > > page is updated, perhaps to lock a tuple again, the page will be frozen and > > the old xmax will be cleared. > > Yes, you're right, it should be possible to make it work that way. But > currently, we look at xmax and infomask of a tuple in heap_lock_tuple() > *before* the PageUpdateNeedsFreezing() call. Currently we would thus > create a new multixact with long dead xids as members and such. That reminds me of something: There are lots of checks sprayed around that unconditionally look at xmin, xmax or infomask. Things I noticed in a quick look after thinking about the previous statment: * AlterEnum() looks at xmin * The PLs look at xmin * So does afair VACUUM FREEZE, COPY and such. * Several HeapTupleSatisfiesVacuum() callers look at xmin and stuff without checking for maturity or freezing the page first.* lazy_scan_heap() itself if the page isn't already marked all_visible * heap_page_is_all_visible() * rewrite_heap_tuple() does an unconditional heap_freeze_tuple() without considering maturity/passing in Invalid* * heap_page_prune_opt() shouldn't do anything on a mature (or even all visible) page. * heap_page_prune() marks a buffer dirty, writes a new xid without changing the lsn. * heap_prune_chain() looks at tuples without considering page maturity, but the current implementation actually looks safe. * CheckForSerializableConflictOut() looks at xmins. * pgrowlocks() looks at xmax unconditionally. * heap_update() computes, just like heap_lock_tuple(), a new xmax/infomask before freezing. * Same for heap_lock_updated_tuple(). Not sure if that's an actual concern, but it might if PageMatureLSN advances or such. * heap_getsysattr() should probably return different things for mature pages. There's probably more. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 19, 2013 at 2:42 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> * switchFinishXmin and nextSwitchXid should probably be either volatile >> or have a compiler barrier between accessing shared memory and >> checking them. The compiler very well could optimize them away and >> access shmem all the time which could lead to weird results. > > Hmm, that would be a weird "optimization". Is that really legal for the > compiler to do? Better safe than sorry, I guess. C doesn't really have a multi-threaded memory model before C11, so the compiler makers have just made one up that suits them best. For unlocked memory reads the compiler is free to assume that no one else is accessing the variable, so yes that "optimization" would be legal according to their rules. I'm tackling similar issues in my patch. What I'm thinking is that we should change our existing supposedly atomic accesses to be more explicit and make the API compatible with C11 atomics[1]. For now I think the changes should be something like this: * Have separate typedefs for atomic variants of datatypes (I don't think we have a whole lot of them). With C11 atomics support,this would change to typedef _Atomic TransactionId AtomicTransactionId; * Require that pg_atomic_init(type, var, val) be used to init atomic variables. Initially it would just pass through to assignment,as all supported platforms can do 32bit atomic ops. C11 compatible compilers will delegate to atomic_init(). * Create atomic read and write macros that look something like: #define pg_atomic_load(type, var) (*((volatile type*) &(var))) and #define pg_atomic_store(type, var, val) do { \ *((volatile type*) &(var)) = (val); \ } while(0) C11 would pass through to the compilers implementation with relaxed memory ordering. * Rename pg_read_barrier()/pg_write_barrier()/pg_memory_barrier() to pg_acquire_fence()/pg_release_fence()/pg_acq_rel_fence().Herb Sutter makes a convincing argument why loosening up the barriersemantics is the sane choice here. [2] C11 support can then pass though to atomic_thread_fence(). This way we have a relatively future proof baseline for lockfree programming, with options to expand with other primitives. We could also only do the volatile access macros part, at least it would make the intention more clear than the current approach of sprinkling around volatile pointers. Regards, Ants Aasma [1] http://en.cppreference.com/w/c/atomic [2] (long video about atomics) http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2 -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Thu, Sep 19, 2013 at 6:27 PM, Ants Aasma <ants@cybertec.at> wrote: > I'm tackling similar issues in my patch. What I'm thinking is that we should > change our existing supposedly atomic accesses to be more explicit and make > the API compatible with C11 atomics[1]. For now I think the changes should be > something like this: > > * Have separate typedefs for atomic variants of datatypes (I don't think we > have a whole lot of them). With C11 atomics support, this would change to > > typedef _Atomic TransactionId AtomicTransactionId; What's the point of this? > * Require that pg_atomic_init(type, var, val) be used to init atomic > variables. Initially it would just pass through to assignment, as all > supported platforms can do 32bit atomic ops. C11 compatible compilers will > delegate to atomic_init(). I don't think this is a bad idea for decoration, but again I'm not sure how much fundamental value it adds. If it makes it easier for people to write code that works in C11 and fails on C89, we lose. > * Create atomic read and write macros that look something like: > > #define pg_atomic_load(type, var) (*((volatile type*) &(var))) > > and > > #define pg_atomic_store(type, var, val) do { \ > *((volatile type*) &(var)) = (val); \ > } while(0) > > C11 would pass through to the compilers implementation with relaxed memory > ordering. Same comment. > * Rename pg_read_barrier()/pg_write_barrier()/pg_memory_barrier() to > pg_acquire_fence()/pg_release_fence()/pg_acq_rel_fence(). Herb Sutter makes > a convincing argument why loosening up the barrier semantics is the sane > choice here. [2] C11 support can then pass though to atomic_thread_fence(). I am entirely unconvinced that we need this. Some people use acquire + release fences, some people use read + write fences, and there are all combinations (read-acquire, read-release, read-acquire-release, write-acquire, write-release, write-acquire-release, both-acquire, both-release, both-acquire-release). I think we're going to have enough trouble deciding between the primitives we already have, let alone with a whole mess of new ones. Mistakes will only manifest themselves on certain platforms and in many cases the races are so tight that actual failures are very unlikely to be reserved in regression testing. Personally, I think the biggest change that would help here is to mandate that spinlock operations serve as compiler fences. That would eliminate the need for scads of volatile references all over the place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I agree with most of what you said - I think that's a littlebit too much change for too little benefit. On 2013-09-20 08:32:29 -0400, Robert Haas wrote: > Personally, I think the biggest change that would help here is to > mandate that spinlock operations serve as compiler fences. That would > eliminate the need for scads of volatile references all over the > place. The effectively already do, don't they? It's an external, no-inlineable function call (s_lock, not the actual TAS). And even if it were to get inlined via LTO optimization, the inline assembler we're using is doing the __asm__ __volatile__ ("...", "memory") dance. That's a full compiler barrier. The non-asm implementations call to OS/compiler primitives that are also fences. In the case I brougth up here there is no spinlock or something similar. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 20, 2013 at 8:40 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-09-20 08:32:29 -0400, Robert Haas wrote: >> Personally, I think the biggest change that would help here is to >> mandate that spinlock operations serve as compiler fences. That would >> eliminate the need for scads of volatile references all over the >> place. > > The effectively already do, don't they? It's an external, no-inlineable > function call (s_lock, not the actual TAS). And even if it were to get > inlined via LTO optimization, the inline assembler we're using is doing > the __asm__ __volatile__ ("...", "memory") dance. That's a full compiler > barrier. > The non-asm implementations call to OS/compiler primitives that are also > fences. > > In the case I brougth up here there is no spinlock or something similar. Well, that doesn't match my previous discussions with Tom Lane, or this comment: * Another caution for users of these macros is that it is the caller's* responsibility to ensure that the compilerdoesn't re-order accesses* to shared memory to precede the actual lock acquisition, or follow the* lockrelease. Typically we handle this by using volatile-qualified* pointers to refer to both the spinlock itself andthe shared data* structure being accessed within the spinlocked critical section.* That fixes it because compilersare not allowed to re-order accesses* to volatile objects relative to other such accesses. That's not to disagree with your point. If TAS is a compiler barrier, then we oughta be OK. Unless something can migrate into the spot between a failed TAS and the call to s_lock? But I'm pretty sure that we've repeatedly had to change code to keep things from falling over in this area, see e.g. commits fa72121594534dda6cc010f0bf6b3e8d22987526, 07eeb9d109c057854b20707562ce517efa591761, d3fc362ec2ce1cf095950dddf24061915f836c22, and 584f818bef68450d23d1b75afbaf19febe38fd37 (the last apparently a live bug). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-09-20 08:54:26 -0400, Robert Haas wrote: > On Fri, Sep 20, 2013 at 8:40 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-09-20 08:32:29 -0400, Robert Haas wrote: > >> Personally, I think the biggest change that would help here is to > >> mandate that spinlock operations serve as compiler fences. That would > >> eliminate the need for scads of volatile references all over the > >> place. > > > > The effectively already do, don't they? It's an external, no-inlineable > > function call (s_lock, not the actual TAS). And even if it were to get > > inlined via LTO optimization, the inline assembler we're using is doing > > the __asm__ __volatile__ ("...", "memory") dance. That's a full compiler > > barrier. > > The non-asm implementations call to OS/compiler primitives that are also > > fences. > > > > In the case I brougth up here there is no spinlock or something similar. > > Well, that doesn't match my previous discussions with Tom Lane, or this comment: > > * Another caution for users of these macros is that it is the caller's > * responsibility to ensure that the compiler doesn't re-order accesses > * to shared memory to precede the actual lock acquisition, or follow the > * lock release. Typically we handle this by using volatile-qualified > * pointers to refer to both the spinlock itself and the shared data > * structure being accessed within the spinlocked critical section. > * That fixes it because compilers are not allowed to re-order accesses > * to volatile objects relative to other such accesses. To me that doesn't really make much sense to be honest.Note that the next paragraph tells us that * On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and* S_UNLOCK() macros must further include hardware-levelmemory fence* instructions to prevent similar re-ordering at the hardware level.* TAS() and TAS_SPIN()must guarantee that loads and stores issued after* the macro are not executed until the lock has been obtained. Conversely,* S_UNLOCK() must guarantee that loads and stores issued before the macro* have been executedbefore the lock is released. so, TAS has to work as a memory barrier if the architecture doesn't have strong cache coherency guarantees itself. But memory barriers have to be compiler barriers? > That's not to disagree with your point. If TAS is a compiler barrier, > then we oughta be OK. Unless something can migrate into the spot > between a failed TAS and the call to s_lock? We're talking compiler barriers right. In that case failure or success do not play a role, or am I missing something? > But I'm pretty sure that > we've repeatedly had to change code to keep things from falling over > in this area, see e.g. commits > 584f818bef68450d23d1b75afbaf19febe38fd37 (the last apparently a live > bug). > d3fc362ec2ce1cf095950dddf24061915f836c22, and I've quickly checked those out, and things looked mighty different back then. And incidentally several of the inline assembly implementations back then didn't specify that they clobber memory (which is what makes it a compiler barrier). > fa72121594534dda6cc010f0bf6b3e8d22987526, Not sure here. Several of the inline assembly bits where changed to clobber memory, but not all. > 07eeb9d109c057854b20707562ce517efa591761, Hm. Those mostly look to be overly cautios to me. I think we should go through the various implementations and make sure they are actual compiler barriers and then change the documented policy. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund escribió: > Hi, > > On 2013-09-19 14:42:19 +0300, Heikki Linnakangas wrote: > > On 18.09.2013 16:22, Andres Freund wrote: > > >* Why can we do a GetOldestXmin(allDbs = false) in > > > BeginXidLSNRangeSwitch()? > > > > Looks like a bug. I think I got the arguments backwards, was supposed to be > > allDbs = true and ignoreVacuum = false. I'm not sure if vacuums could be > > ignored, but 'false' is the safe option. > > Not sure either... The "ignoreVacuum" flag specifies to ignore backends running non-full VACUUM, that is, processes that are known never to generate new Xids, never obtain Xmin, and never to insert tuples anywhere. With all these restrictions, I think it's safe to use ignoreVacuum=true here. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-09-20 16:47:24 +0200, Andres Freund wrote: > I think we should go through the various implementations and make sure > they are actual compiler barriers and then change the documented policy. From a quick look * S_UNLOCK for PPC isn't a compiler barrier * S_UNLOCK for MIPS isn't a compiler barrier * I don't know enough about unixware (do we still support that as a platform even) to judge * True64 Alpha I have no clue about * PA-RISCs tas() might not be a compiler barrier for !GCC * PA-RISCs S_UNLOCK might not be a compiler barrier * HP-UX !GCC might not * IRIX 5 seems to be a compiler barrier * SINIX - I don't care * AIX PPC - compiler barrier * Sun - TAS is implemented in external assembly, normal function call, compiler barrier * Win(32|64) - compiler barrier * Generic S_UNLOCK *NOT* necessarily a compiler barrier. Ok, so I might have been a bit too optimistic... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi
On Fri, Sep 20, 2013 at 5:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-09-20 16:47:24 +0200, Andres Freund wrote:From a quick look
> I think we should go through the various implementations and make sure
> they are actual compiler barriers and then change the documented policy.
* S_UNLOCK for PPC isn't a compiler barrier
* S_UNLOCK for MIPS isn't a compiler barrier
* I don't know enough about unixware (do we still support that as a
platform even) to judge
* True64 Alpha I have no clue about
* PA-RISCs tas() might not be a compiler barrier for !GCC
* PA-RISCs S_UNLOCK might not be a compiler barrier
* HP-UX !GCC might not
* IRIX 5 seems to be a compiler barrier
* SINIX - I don't care
* AIX PPC - compiler barrier
* Sun - TAS is implemented in external assembly, normal function call,
compiler barrier
* Win(32|64) - compiler barrier
* Generic S_UNLOCK *NOT* necessarily a compiler barrier.
Ok, so I might have been a bit too optimistic...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
IMO it's a bug if S_UNLOCK is a not a compiler barrier.https://www.securecoding.cert.org/confluence/display/seccode/DCL17-C.+Beware+of+miscompiled+volatile-qualified+variables
regards
Didier
On Fri, Sep 20, 2013 at 5:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-09-20 16:47:24 +0200, Andres Freund wrote:From a quick look
> I think we should go through the various implementations and make sure
> they are actual compiler barriers and then change the documented policy.
* S_UNLOCK for PPC isn't a compiler barrier
* S_UNLOCK for MIPS isn't a compiler barrier
* I don't know enough about unixware (do we still support that as a
platform even) to judge
* True64 Alpha I have no clue about
* PA-RISCs tas() might not be a compiler barrier for !GCC
* PA-RISCs S_UNLOCK might not be a compiler barrier
* HP-UX !GCC might not
* IRIX 5 seems to be a compiler barrier
* SINIX - I don't care
* AIX PPC - compiler barrier
* Sun - TAS is implemented in external assembly, normal function call,
compiler barrier
* Win(32|64) - compiler barrier
* Generic S_UNLOCK *NOT* necessarily a compiler barrier.
Ok, so I might have been a bit too optimistic...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 20, 2013 at 11:11 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-09-20 16:47:24 +0200, Andres Freund wrote: >> I think we should go through the various implementations and make sure >> they are actual compiler barriers and then change the documented policy. > > From a quick look > * S_UNLOCK for PPC isn't a compiler barrier > * S_UNLOCK for MIPS isn't a compiler barrier > * I don't know enough about unixware (do we still support that as a > platform even) to judge > * True64 Alpha I have no clue about > * PA-RISCs tas() might not be a compiler barrier for !GCC > * PA-RISCs S_UNLOCK might not be a compiler barrier > * HP-UX !GCC might not > * IRIX 5 seems to be a compiler barrier > * SINIX - I don't care > * AIX PPC - compiler barrier > * Sun - TAS is implemented in external assembly, normal function call, > compiler barrier > * Win(32|64) - compiler barrier > * Generic S_UNLOCK *NOT* necessarily a compiler barrier. > > Ok, so I might have been a bit too optimistic... Yeah, it seems worth fixing, but it's not going to be a 5-minute project, I fear. But why do you think that this is not a compiler barrier (PPC): __asm__ __volatile__ (" sync \n"); \ *((volatile slock_t *) (lock)) = 0; \ Surely, the __asm__ __volatile__ must be a compiler barrier, but the compiler could presumably allow the store to the lock itself to move forward past other non-volatilized stores during optimization? Is that what you're concerned about? If so, that's easily fixed by sticking __asm__ __volatile__("":::"memory") in there. But some of the other cases are less clear. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-09-23 11:50:05 -0400, Robert Haas wrote: > On Fri, Sep 20, 2013 at 11:11 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-09-20 16:47:24 +0200, Andres Freund wrote: > >> I think we should go through the various implementations and make sure > >> they are actual compiler barriers and then change the documented policy. > > > > From a quick look > > * S_UNLOCK for PPC isn't a compiler barrier > > * S_UNLOCK for MIPS isn't a compiler barrier > > * I don't know enough about unixware (do we still support that as a > > platform even) to judge > > * True64 Alpha I have no clue about > > * PA-RISCs tas() might not be a compiler barrier for !GCC > > * PA-RISCs S_UNLOCK might not be a compiler barrier > > * HP-UX !GCC might not > > * IRIX 5 seems to be a compiler barrier > > * SINIX - I don't care > > * AIX PPC - compiler barrier > > * Sun - TAS is implemented in external assembly, normal function call, > > compiler barrier > > * Win(32|64) - compiler barrier > > * Generic S_UNLOCK *NOT* necessarily a compiler barrier. > > > > Ok, so I might have been a bit too optimistic... > > Yeah, it seems worth fixing, but it's not going to be a 5-minute > project, I fear. Yea :(. I think we should start by trimming the above list by removing some platforms: * SINIX - doesn't actually seem to be supported * Tru64 - not even a zombie anymore * IRIX - ... The others probably can't be removed? > But why do you think that this is not a compiler barrier (PPC): > > __asm__ __volatile__ (" sync \n"); \ > *((volatile slock_t *) (lock)) = 0; \ > > Surely, the __asm__ __volatile__ must be a compiler barrier, but the > compiler could presumably allow the store to the lock itself to move > forward past other non-volatilized stores during optimization? Is > that what you're concerned about? If so, that's easily fixed by > sticking __asm__ __volatile__("":::"memory") in there. Yes, the memory clobber is missing for PPC and MIPS. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Just some notes, before I forget them again. On 2013-09-23 11:50:05 -0400, Robert Haas wrote: > On Fri, Sep 20, 2013 at 11:11 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-09-20 16:47:24 +0200, Andres Freund wrote: > >> I think we should go through the various implementations and make sure > >> they are actual compiler barriers and then change the documented policy. > > > > From a quick look > > * S_UNLOCK for PPC isn't a compiler barrier > > * S_UNLOCK for MIPS isn't a compiler barrier Needs memory clobber looks easy enough given only gcc seems to be supported. > > * I don't know enough about unixware (do we still support that as a > > platform even) to judge Looks like another thing to remove? > > * PA-RISCs tas() might not be a compiler barrier for !GCC > > * PA-RISCs S_UNLOCK might not be a compiler barrier http://h21007.www2.hp.com/portal/download/files/unprot/ddk/mem_ordering_pa_ia.pdf _MF(); seems to work on both PA-RISC and itanic. > > * HP-UX !GCC might not The document linked from the source code explains how to implement it: http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf #define COMP_FENCE _Asm_fence(_UP_MEM_FENCE|_DOWN_MEM_FENCE) > > * Generic S_UNLOCK *NOT* necessarily a compiler barrier. Not sure how to handle that one. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 19.09.2013 16:24, Andres Freund wrote: > On 2013-09-19 14:40:35 +0200, Andres Freund wrote: >>>> * I think heap_lock_tuple() needs to unset all-visible, otherwise we >>>> won't vacuum that page again which can lead to problems since we >>>> don't do full-table vacuums again? >>> >>> It's OK if the page is never vacuumed again, the whole point of the patch is >>> that old XIDs can be just left lingering in the table. The next time the >>> page is updated, perhaps to lock a tuple again, the page will be frozen and >>> the old xmax will be cleared. >> >> Yes, you're right, it should be possible to make it work that way. But >> currently, we look at xmax and infomask of a tuple in heap_lock_tuple() >> *before* the PageUpdateNeedsFreezing() call. Currently we would thus >> create a new multixact with long dead xids as members and such. Ah, I see. > That reminds me of something: > > There are lots of checks sprayed around that unconditionally look at > xmin, xmax or infomask. > > Things I noticed in a quick look after thinking about the previous > statment: > * AlterEnum() looks at xmin > * The PLs look at xmin > * So does afair VACUUM FREEZE, COPY and such. > * Several HeapTupleSatisfiesVacuum() callers look at xmin and stuff > without checking for maturity or freezing the page first. > * lazy_scan_heap() itself if the page isn't already marked all_visible > * heap_page_is_all_visible() > * rewrite_heap_tuple() does an unconditional heap_freeze_tuple() without > considering maturity/passing in Invalid* > * heap_page_prune_opt() shouldn't do anything on a mature (or even all > visible) page. > * heap_page_prune() marks a buffer dirty, writes a new xid without > changing the lsn. > * heap_prune_chain() looks at tuples without considering page maturity, > but the current implementation actually looks safe. > * CheckForSerializableConflictOut() looks at xmins. > * pgrowlocks() looks at xmax unconditionally. > * heap_update() computes, just like heap_lock_tuple(), a new > xmax/infomask before freezing. > * Same for heap_lock_updated_tuple(). Not sure if that's an actual > concern, but it might if PageMatureLSN advances or such. > * heap_getsysattr() should probably return different things for mature > pages. > > There's probably more. Hmm, some of those are trivial, but others, rewrite_heap_tuple() are currently only passed the HeapTuple, with no reference to the page where the tuple came from. Instead of changing code to always pass that along with a tuple, I think we should add a boolean to HeapTupleData, to indicate if the tuple came from a mature page. If the flag is set, the tuple should be considered visible to everyone, without looking at the xmin/xmax. This might affect extensions, although usually external C code that have to deal with HeapTuples will use functions like heap_form_tuple to do so, and won't need to deal with low-level stuff or visibility themselves. Attached is a new version, which adds that field to HeapTupleData. Most of the issues on you listed above have been fixed, plus a bunch of other bugs I found myself. The bug that Jeff ran into with his count.pl script has also been fixed. - Heikki
Attachment
On 18.09.2013 22:55, Jeff Janes wrote: > On Mon, Sep 16, 2013 at 6:59 AM, Heikki Linnakangas<hlinnakangas@vmware.com >> wrote: >> Here's a rebased version of the patch, including the above-mentioned >> fixes. Nothing else new. > > I've applied this to 0892ecbc015930d, the last commit to which it applies > cleanly. > > When I test this by repeatedly incrementing a counter in a randomly chosen > row, then querying the whole table and comparing the results to what my > driver knows they should be, I get discrepancies. Ok, I found the bug. The problem was that when a HOT chain begins with a dead tuple, when the page was frozen, the dead tuple was not removed, but the xmin of the live tuple in the chain was replaced with FrozenXid. That breaks the HOT-chain following code, which checks that the xmin of the next tuple in the chain matches the xmax of the previous tuple. I fixed that by simply not freezing a page which contains any dead tuples. That's OK because the page will be visited by vacuum before it becomes old enough to be "mature". However, it means that the invariant that a page can only contain XIDs within one XID-LSN range, determined by the LSN, is no longer true. AFAICS it everything still works, but I would feel more comfortable if we could uphold that invariant, for debugging reasons if nothing else. Will have to give that some more thought.. Thanks for the testing! I just posted an updated version of the patch elsewhere in this thread. - Heikki
On 9/25/13 5:31 AM, Heikki Linnakangas wrote: > Attached is a new version, which adds that field to HeapTupleData. Most > of the issues on you listed above have been fixed, plus a bunch of other > bugs I found myself. The bug that Jeff ran into with his count.pl script > has also been fixed. This patch doesn't apply at all: Reversed (or previously applied) patch detected! Assuming -R. 1 out of 39 hunks FAILED -- saving rejects to file src/backend/access/heap/heapam.c.rej 5 out of 9 hunks FAILED -- saving rejects to file src/backend/access/heap/rewriteheap.c.rej Reversed (or previously applied) patch detected! Assuming -R. Reversed (or previously applied) patch detected! Assuming -R. Reversed (or previously applied) patch detected! Assuming -R. 10 out of 12 hunks FAILED -- saving rejects to file src/backend/access/transam/xlog.c.rej 13 out of 18 hunks FAILED -- saving rejects to file src/backend/commands/cluster.c.rej Reversed (or previously applied) patch detected! Assuming -R. 15 out of 17 hunks FAILED -- saving rejects to file src/backend/commands/vacuum.c.rej 5 out of 24 hunks FAILED -- saving rejects to file src/backend/commands/vacuumlazy.c.rej 3 out of 9 hunks FAILED -- saving rejects to file src/backend/postmaster/autovacuum.c.rej Reversed (or previously applied) patch detected! Assuming -R. Reversed (or previously applied) patch detected! Assuming -R. Reversed (or previously applied) patch detected! Assuming -R. 1 out of 4 hunks FAILED -- saving rejects to file src/backend/utils/misc/guc.c.rej Reversed (or previously applied) patch detected! Assuming -R. Reversed (or previously applied) patch detected! Assuming -R. 1 out of 2 hunks FAILED -- saving rejects to file src/include/commands/cluster.h.rej 1 out of 1 hunk FAILED -- saving rejects to file src/include/commands/vacuum.h.rej Reversed (or previously applied) patch detected! Assuming -R.
On 25.09.2013 15:48, Peter Eisentraut wrote: > On 9/25/13 5:31 AM, Heikki Linnakangas wrote: >> Attached is a new version, which adds that field to HeapTupleData. Most >> of the issues on you listed above have been fixed, plus a bunch of other >> bugs I found myself. The bug that Jeff ran into with his count.pl script >> has also been fixed. > > This patch doesn't apply at all: Huh, that's strange. Merged with latest changes from master, here's a new version that applies. - Heikki
Attachment
Just found this in my drafts folder. Sorry for the late response. On Fri, Sep 20, 2013 at 3:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I am entirely unconvinced that we need this. Some people use acquire > + release fences, some people use read + write fences, and there are > all combinations (read-acquire, read-release, read-acquire-release, > write-acquire, write-release, write-acquire-release, both-acquire, > both-release, both-acquire-release). I think we're going to have > enough trouble deciding between the primitives we already have, let > alone with a whole mess of new ones. Mistakes will only manifest > themselves on certain platforms and in many cases the races are so > tight that actual failures are very unlikely to be reserved in > regression testing. I have to retract my proposal to try to emulate C11 atomics in C89. I guess this goes to show why one shouldn't try to conjure up atomic API's at 2AM. I forgot the fact that while acquire-release semantics are enough to ensure sequentially consistent behavior, the actual barriers need to be paired with specific atomic accesses to achieve that. It's not possible to use freestanding acquire/release barriers to do implement this, nor would it be possible to include barriers in the atomic accesses themselves without causing significant pessimization. Sequentially consistency (along with causal transitivity and total store ordering that come with it) should be regarded as a goal. I'm not able to reason about concurrent programs without those guarantees, and I suspect no one else is either. Sequential consistency is guaranteed if atomic variables (including locks) are accessed with appropriate acquire and release semantics. We just need to use a hodge-podge of read/write/full barriers and volatile memory accesses to actually implement the semantics until some distant future date where we can start relying on compilers getting it right. I still think we should have a macro for the volatile memory accesses. As a rule, each one of those needs a memory barrier, and if we consolidate them, or optimize them out, the considerations why this is safe should be explained in a comment. Race prone memory accesses should stick out like a sore thumb. > Personally, I think the biggest change that would help here is to > mandate that spinlock operations serve as compiler fences. That would > eliminate the need for scads of volatile references all over the > place. +1. The commits that you showed fixing issues in this area show quite well why this is a good idea. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 2013-10-01 04:47:42 +0300, Ants Aasma wrote: > I still think we should have a macro for the volatile memory accesses. > As a rule, each one of those needs a memory barrier, and if we > consolidate them, or optimize them out, the considerations why this is > safe should be explained in a comment. Race prone memory accesses > should stick out like a sore thumb. Agreed. The "wait free LW_SHARED" thing[1] I posted recently had a simple #define pg_atomic_read(atomic) (*(volatile uint32 *)&(atomic)) That should be sufficient and easily greppable, right? I generally think we need to provide some primitives for doing atomic stuff. There's lots of stuff that's not easy to accelerate further without. Greetings, Andres Freund [1] http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Oct 1, 2013 at 2:13 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Agreed. The "wait free LW_SHARED" thing[1] I posted recently had a simple > > #define pg_atomic_read(atomic) (*(volatile uint32 *)&(atomic)) > > That should be sufficient and easily greppable, right? Looks good enough for me. I would consider using a naming scheme that accounts for possible future uint64 atomics. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 2013-09-25 12:31:20 +0300, Heikki Linnakangas wrote: > Hmm, some of those are trivial, but others, rewrite_heap_tuple() are > currently only passed the HeapTuple, with no reference to the page where the > tuple came from. Instead of changing code to always pass that along with a > tuple, I think we should add a boolean to HeapTupleData, to indicate if the > tuple came from a mature page. If the flag is set, the tuple should be > considered visible to everyone, without looking at the xmin/xmax. This might > affect extensions, although usually external C code that have to deal with > HeapTuples will use functions like heap_form_tuple to do so, and won't need > to deal with low-level stuff or visibility themselves. > > Attached is a new version, which adds that field to HeapTupleData. Most of > the issues on you listed above have been fixed, plus a bunch of other bugs I > found myself. The bug that Jeff ran into with his count.pl script has also > been fixed. This seems a bit hacky to me. Change a widely used struct because a few functions don't get passed enough information? Visibilitychecks are properly done with a buffer passed along; that some places have cut corners around that doesn't mean we have to continue to do so. The pullups around rs_pagemature are imo indicative of this (there's even a bug because a page can be all visible before it's mature, right? That could then cause an assert somewhere down the line when we check page and tuple are coherent). Ok, making another scan through this: * Why can we do PageSetLSN(page, RangeSwitchLSN) in heap_* when the action doesn't need WAL? And why is that correct? I cansee doing that in vacuum itself, but doing it when we write *new* data to the page? * heap_freeze_page(): The PageSetLSN(page, RangeSwitchLSN) if there's nothing to log is not WAL logged. Which means thatif we crash it won't necessarily be set, so the VM and the heap lsn might get out of sync. That probably doesn't haveany bad effects, but imo deserves a comment. * heap_freeze_page(): PageUpdateNeedsFreezing() can now be true before and after. That's a bit confusing :/ * GetSafeClogTruncationPoint truncates the xid-lsn ranges, but there's also an, uncommented, TruncateXidLSNRanges. At leashow they work together should be described better. * There's quite some FIXMEs around * Let's move the xid-lsn ranges code from GetNewTransactionId() into it's own function. * PageMatureLSN and RangeSwitchLSN should be documented somewhere. They are implicitly, but they are used widely enough thatthat doesn't seem sufficient. * pg_controldata should output pageMatureLSN Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, 2013-09-25 at 12:31 +0300, Heikki Linnakangas wrote: > On 19.09.2013 16:24, Andres Freund wrote: > ... > > There's probably more. I think _bt_check_unique is also a problem. > Hmm, some of those are trivial, but others, rewrite_heap_tuple() are > currently only passed the HeapTuple, with no reference to the page where > the tuple came from. Instead of changing code to always pass that along > with a tuple, I think we should add a boolean to HeapTupleData, to > indicate if the tuple came from a mature page. If the flag is set, the > tuple should be considered visible to everyone, without looking at the > xmin/xmax. This might affect extensions, although usually external C > code that have to deal with HeapTuples will use functions like > heap_form_tuple to do so, and won't need to deal with low-level stuff or > visibility themselves. How bad would the code be to just pass along the buffer when it's needed? After looking around, it doesn't seem necessarily worse than adding the struct field (and I agree with Andres that it seems more proper to pass the buffer along). I also have a separate question about this patch: It seems like this gets us closer to 64-bit transaction IDs. Can we get some help from the compiler to distinguish the two cases in a less error-prone way? Is it worth thinking about making 64-bit the norm, and 32-bit an optimization in certain places (e.g. the heap and maybe the xip array of a snapshot)? Regards,Jeff Davis
Shouldn't this patch be in the January commitfest? -- Peter Geoghegan
On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote: > Shouldn't this patch be in the January commitfest? I think we previously concluded that there wasn't much chance to get this into 9.4 and there's significant work to be done on the patch before new reviews are required, so not submitting it imo makes sense. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 26 January 2014 12:58, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote: >> Shouldn't this patch be in the January commitfest? > > I think we previously concluded that there wasn't much chance to get > this into 9.4 and there's significant work to be done on the patch > before new reviews are required, so not submitting it imo makes sense. I think we should make this a priority feature for 9.5 -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 27, 2014 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 26 January 2014 12:58, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote: >>> Shouldn't this patch be in the January commitfest? >> >> I think we previously concluded that there wasn't much chance to get >> this into 9.4 and there's significant work to be done on the patch >> before new reviews are required, so not submitting it imo makes sense. > > I think we should make this a priority feature for 9.5 +1. I can't think of many things we might do that would be more important. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Monday, January 27, 2014, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jan 27, 2014 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 26 January 2014 12:58, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote:
>>> Shouldn't this patch be in the January commitfest?
>>
>> I think we previously concluded that there wasn't much chance to get
>> this into 9.4 and there's significant work to be done on the patch
>> before new reviews are required, so not submitting it imo makes sense.
>
> I think we should make this a priority feature for 9.5
+1. I can't think of many things we might do that would be more important.
Can anyone guess how likely this approach is to make it into 9.5? I've been pondering some incremental improvements over what we have now, but if this revolutionary approach has a high chance of landing then any work on incremental improvements would be pointless.
Thanks,
Jeff
On Wed, May 14, 2014 at 05:46:49PM -0700, Jeff Janes wrote: > On Monday, January 27, 2014, Robert Haas <robertmhaas@gmail.com> wrote: > > > On Mon, Jan 27, 2014 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com<javascript:;>> > > wrote: > > > On 26 January 2014 12:58, Andres Freund <andres@2ndquadrant.com<javascript:;>> > > wrote: > > >> On 2014-01-25 20:26:16 -0800, Peter Geoghegan wrote: > > >>> Shouldn't this patch be in the January commitfest? > > >> > > >> I think we previously concluded that there wasn't much chance to get > > >> this into 9.4 and there's significant work to be done on the patch > > >> before new reviews are required, so not submitting it imo makes sense. > > > > > > I think we should make this a priority feature for 9.5 > > > > +1. I can't think of many things we might do that would be more important. > > > Can anyone guess how likely this approach is to make it into 9.5? > I've been pondering some incremental improvements over what we have > now, but if this revolutionary approach has a high chance of landing > then any work on incremental improvements would be pointless. How much work do you anticipate for the description and PoC of whatever it is you've been pondering? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, May 14, 2014 at 8:46 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> +1. I can't think of many things we might do that would be more >> important. > > Can anyone guess how likely this approach is to make it into 9.5? I've been > pondering some incremental improvements over what we have now, but if this > revolutionary approach has a high chance of landing then any work on > incremental improvements would be pointless. Well, Heikki was saying on another thread that he had kind of gotten cold feet about this, so I gather he's not planning to pursue it. Not sure if I understood that correctly. If so, I guess it depends on whether someone else can pick it up, but we might first want to establish why he got cold feet and how worrying those problems seem to other people. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company