Thread: How much do the hint bits help?
I've been playing around with postgresql hint bits in order to teach myself more about the internals of the MVCC system. I noticed that the hint bit system has been around forever (Vadim era) and predates several backend improvements that might affect their usefulness. So I started playing around, trying to quantify the benefit they provide with an eye of optimizing clog lookups if it turned out to be necessary say by mmap-ing a big transaction status file just to see if that helped. Attached is an incomplete patch disabling hint bits based on compile switch. It's not complete, for example it's not reconciling some assumptions in heapam.c that hint bits have been set in various routines. However, it mostly passes regression and I deemed it good enough to run some preliminary benchmarks and fool around. Obviously, hint bits are an annoying impediment to a couple of other cool pending features, and it certainly would be nice to operate without them. Also, for particular workloads, the extra i/o hint bits can cause a fair amount of pain. So far, at least doing pgbench runs and another test designed to exercise clog lookups, the performance loss of always doing full lookup hasn't materialized. Note that in these cases the clog lru cache is pretty effective, and it's pretty likely I may have blown it in some other way, so take the results for a grain of salt. But, here are the following questions/points: *) relative to when the hint bits where implemented, the amount of transactions to map has shrunk, while hardware has improved by a couple of orders of magnitude. Also the postgres architecture has changed considerably. Are they still necessary? *) what's a good way to stress the clog severely? I'd like to pick a degenerate case to get a better idea of the way things stand without them. *) is there community interest in a full patch that fills in the missing details not implemented here? merlin
Attachment
Merlin Moncure <mmoncure@gmail.com> wrote: > *) what's a good way to stress the clog severely? I'd like to pick > a degenerate case to get a better idea of the way things stand > without them. The worst I can think of is a large database with a 90/10 mix of reads to writes -- all short transactions. Maybe someone else can do better. In particular, I'm not sure how savepoints might play into a degenerate case. Since we're always talking about how to do better with hint bits during an unlogged bulk load, it would be interesting to benchmark one of those followed by a `select count(*) from newtable;` with and without the patch, on a data set too big to fit in RAM. > *) is there community interest in a full patch that fills in the > missing details not implemented here? I'm certainly curious to see real numbers. -Kevin
On 22/12/10 11:42, Merlin Moncure wrote: > Attached is an incomplete patch disabling hint bits based on compile > switch. It's not complete, for example it's not reconciling some > assumptions in heapam.c that hint bits have been set in various > routines. However, it mostly passes regression and I deemed it good > enough to run some preliminary benchmarks and fool around. Obviously, > hint bits are an annoying impediment to a couple of other cool pending > features, and it certainly would be nice to operate without them. > Also, for particular workloads, the extra i/o hint bits can cause a > fair amount of pain. Looks like a great idea to test, however I don't seem to be able to compile with it applied: (set#define DISABLE_HINT_BITS 1 at the end of src/include/pg_config_manual.h) gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv -g -I../../../../src/include -D_GNU_SOURCE -c -o heapam.o heapam.c heapam.c: In function ‘HeapTupleHeaderAdvanceLatestRemovedXid’: heapam.c:3867: error: ‘HEAP_XMIN_COMMITTED’ undeclared (first use in this function) heapam.c:3867: error: (Each undeclared identifier is reported only once heapam.c:3867: error: for each function it appears in.) heapam.c:3869: error: ‘HEAP_XMIN_INVALID’ undeclared (first use in this function) make[4]: *** [heapam.o] Error 1
On 22/12/10 13:05, Mark Kirkwood wrote: > On 22/12/10 11:42, Merlin Moncure wrote: >> Attached is an incomplete patch disabling hint bits based on compile >> switch. It's not complete, for example it's not reconciling some >> assumptions in heapam.c that hint bits have been set in various >> routines. However, it mostly passes regression and I deemed it good >> enough to run some preliminary benchmarks and fool around. Obviously, >> hint bits are an annoying impediment to a couple of other cool pending >> features, and it certainly would be nice to operate without them. >> Also, for particular workloads, the extra i/o hint bits can cause a >> fair amount of pain. > > Looks like a great idea to test, however I don't seem to be able to > compile with it applied: (set#define DISABLE_HINT_BITS 1 at the end of > src/include/pg_config_manual.h) > > gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith > -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing > -fwrapv -g -I../../../../src/include -D_GNU_SOURCE -c -o heapam.o > heapam.c > heapam.c: In function ‘HeapTupleHeaderAdvanceLatestRemovedXid’: > heapam.c:3867: error: ‘HEAP_XMIN_COMMITTED’ undeclared (first use in > this function) > heapam.c:3867: error: (Each undeclared identifier is reported only once > heapam.c:3867: error: for each function it appears in.) > heapam.c:3869: error: ‘HEAP_XMIN_INVALID’ undeclared (first use in > this function) > make[4]: *** [heapam.o] Error 1 > Arrg, sorry - against git head on Ubuntu 10.03 (gcc 4.4.3)
On Tue, Dec 21, 2010 at 7:06 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > On 22/12/10 13:05, Mark Kirkwood wrote: >> >> On 22/12/10 11:42, Merlin Moncure wrote: >>> >>> Attached is an incomplete patch disabling hint bits based on compile >>> switch. It's not complete, for example it's not reconciling some >>> assumptions in heapam.c that hint bits have been set in various >>> routines. However, it mostly passes regression and I deemed it good >>> enough to run some preliminary benchmarks and fool around. Obviously, >>> hint bits are an annoying impediment to a couple of other cool pending >>> features, and it certainly would be nice to operate without them. >>> Also, for particular workloads, the extra i/o hint bits can cause a >>> fair amount of pain. >> >> Looks like a great idea to test, however I don't seem to be able to >> compile with it applied: (set#define DISABLE_HINT_BITS 1 at the end of >> src/include/pg_config_manual.h) >> >> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith >> -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv -g >> -I../../../../src/include -D_GNU_SOURCE -c -o heapam.o heapam.c >> heapam.c: In function ‘HeapTupleHeaderAdvanceLatestRemovedXid’: >> heapam.c:3867: error: ‘HEAP_XMIN_COMMITTED’ undeclared (first use in this >> function) >> heapam.c:3867: error: (Each undeclared identifier is reported only once >> heapam.c:3867: error: for each function it appears in.) >> heapam.c:3869: error: ‘HEAP_XMIN_INVALID’ undeclared (first use in this >> function) >> make[4]: *** [heapam.o] Error 1 >> > > Arrg, sorry - against git head on Ubuntu 10.03 (gcc 4.4.3) did you check to see if the patch applied clean? btw I was working against postgresql-9.0.1... it looks like you are missing at least some of the changes to htup.h: ../postgresql-9.0.1_hb2/src/include/access/htup.h #ifndef DISABLE_HINT_BITS #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */ #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */ #define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */ #define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */ #endif merlin
On Tue, Dec 21, 2010 at 7:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Dec 21, 2010 at 7:06 PM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz> wrote: >> On 22/12/10 13:05, Mark Kirkwood wrote: >>> >>> On 22/12/10 11:42, Merlin Moncure wrote: >>>> >>>> Attached is an incomplete patch disabling hint bits based on compile >>>> switch. It's not complete, for example it's not reconciling some >>>> assumptions in heapam.c that hint bits have been set in various >>>> routines. However, it mostly passes regression and I deemed it good >>>> enough to run some preliminary benchmarks and fool around. Obviously, >>>> hint bits are an annoying impediment to a couple of other cool pending >>>> features, and it certainly would be nice to operate without them. >>>> Also, for particular workloads, the extra i/o hint bits can cause a >>>> fair amount of pain. >>> >>> Looks like a great idea to test, however I don't seem to be able to >>> compile with it applied: (set#define DISABLE_HINT_BITS 1 at the end of >>> src/include/pg_config_manual.h) >>> >>> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith >>> -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv -g >>> -I../../../../src/include -D_GNU_SOURCE -c -o heapam.o heapam.c >>> heapam.c: In function ‘HeapTupleHeaderAdvanceLatestRemovedXid’: >>> heapam.c:3867: error: ‘HEAP_XMIN_COMMITTED’ undeclared (first use in this >>> function) >>> heapam.c:3867: error: (Each undeclared identifier is reported only once >>> heapam.c:3867: error: for each function it appears in.) >>> heapam.c:3869: error: ‘HEAP_XMIN_INVALID’ undeclared (first use in this >>> function) >>> make[4]: *** [heapam.o] Error 1 >>> >> >> Arrg, sorry - against git head on Ubuntu 10.03 (gcc 4.4.3) > > did you check to see if the patch applied clean? btw I was working > against postgresql-9.0.1... ah, this is the problem (9.0.1 vs head). to work vs head it prob needs a few more tweaks. you can also try removing it yourself -- most of the changes follow a similar pattern. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > Attached is an incomplete patch disabling hint bits based on compile > switch. ... > So far, at least doing pgbench runs and another test designed to > exercise clog lookups, the performance loss of always doing full > lookup hasn't materialized. The standard pgbench test would be just about 100% useless for stressing this, because its net database activity is only about one row touched/updated per query. You need a test case that hits lots of rows per query, else you're just measuring parse+plan+network overhead. regards, tom lane
On Tue, Dec 21, 2010 at 7:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> Attached is an incomplete patch disabling hint bits based on compile >> switch. ... >> So far, at least doing pgbench runs and another test designed to >> exercise clog lookups, the performance loss of always doing full >> lookup hasn't materialized. > > The standard pgbench test would be just about 100% useless for stressing > this, because its net database activity is only about one row > touched/updated per query. You need a test case that hits lots of rows > per query, else you're just measuring parse+plan+network overhead. right -- see the attached clog_stress.sql above. It creates a script that inserts records in blocks of 10000, deletes half of them, and vacuums. Neither the execution of the script nor a seq scan following its execution showed an interesting performance difference (which I am arbitrarily calling 5% in either direction). Like I said though, I don't trust the patch or the results yet. @Mark: apparently the cvs server is behind git and there are some recent changes to heapam.c that need more attention. I need to get git going on my box, but try changing this: if ((tuple->t_infomask & HEAP_XMIN_COMMITTED) || (!(tuple->t_infomask & HEAP_XMIN_COMMITTED) && !(tuple->t_infomask& HEAP_XMIN_INVALID) && TransactionIdDidCommit(xmin))) to this: if (TransactionIdDidCommit(xmin)) also, isn't the extra check vs HEAP_XMIN_COMMITTED redundant, and if you do have to look up clog, why not set the hint bit? merlin
On 22/12/10 13:56, Merlin Moncure wrote: > On Tue, Dec 21, 2010 at 7:45 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > > @Mark: apparently the cvs server is behind git and there are some > recent changes to heapam.c that need more attention. I need to get > git going on my box, but try changing this: > > if ((tuple->t_infomask& HEAP_XMIN_COMMITTED) || > (!(tuple->t_infomask& HEAP_XMIN_COMMITTED)&& > !(tuple->t_infomask& HEAP_XMIN_INVALID)&& > TransactionIdDidCommit(xmin))) > > to this: > > if (TransactionIdDidCommit(xmin)) > > also, isn't the extra check vs HEAP_XMIN_COMMITTED redundant, and if > you do have to look up clog, why not set the hint bit? > That gets it compiling.
On 22.12.2010 02:56, Merlin Moncure wrote: > On Tue, Dec 21, 2010 at 7:45 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> Merlin Moncure<mmoncure@gmail.com> writes: >>> Attached is an incomplete patch disabling hint bits based on compile >>> switch. ... >>> So far, at least doing pgbench runs and another test designed to >>> exercise clog lookups, the performance loss of always doing full >>> lookup hasn't materialized. >> >> The standard pgbench test would be just about 100% useless for stressing >> this, because its net database activity is only about one row >> touched/updated per query. You need a test case that hits lots of rows >> per query, else you're just measuring parse+plan+network overhead. > > right -- see the attached clog_stress.sql above. It creates a script > that inserts records in blocks of 10000, deletes half of them, and > vacuums. Neither the execution of the script nor a seq scan following > its execution showed an interesting performance difference (which I am > arbitrarily calling 5% in either direction). Like I said though, I > don't trust the patch or the results yet. Make sure you have a good mix of different xids in the table, TransactionLogFetch has a one-item cache so repeatedly checking the same xid is much faster than the general case. Perhaps run pgbench for a while, and then do "SELECT COUNT(*)" on the resulting tables. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2010-12-21 at 17:42 -0500, Merlin Moncure wrote: > *) is there community interest in a full patch that fills in the > missing details not implemented here? You're thinking seems sound to me. We now have all-visible flags, fewer xids, much better clog concurrency. Avoiding hint bits would also noticeably reduce number of dirty writes, especially at checkpoint. Hot Standby already ignores hint bits and I've not heard a single complaint, so we are already doing this in the code. I don't see any reason to believe that there is not an equally effective optimisation that we can apply to bring performance back up, if it is shown to drop in particular use cases. I would vote to put this into 9.1 as a non-default option at restart, opening the door to other features which hint bits are frustrating. People can then choose between those features and the "power of hint bits". I think many people would choose db block checksums. If you need support, or direct help with the code, just ask. Am happy to be your committer also. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 22.12.2010 15:21, Simon Riggs wrote: > On Tue, 2010-12-21 at 17:42 -0500, Merlin Moncure wrote: > >> *) is there community interest in a full patch that fills in the >> missing details not implemented here? > > You're thinking seems sound to me. We now have all-visible flags, fewer > xids, much better clog concurrency. Avoiding hint bits would also > noticeably reduce number of dirty writes, especially at checkpoint. Yep. > Hot Standby already ignores hint bits and I've not heard a single > complaint, so we are already doing this in the code. No, the XMIN/XMAX committed/invalid hint bits on each heap tuple are used during hot sandby just like during normal operation. We ignore the index tuples marked as dead during hot standby, but that's a different issue. > I would vote to put this into 9.1 as a non-default option at restart, > opening the door to other features which hint bits are frustrating. > People can then choose between those features and the "power of hint > bits". I think many people would choose db block checksums. Making it optional would add some ifs in the critical paths, possibly making it slower. My gut feeling is that a reasonable compromise is to set hint bits like we do today, but don't mark the page as dirty when only hint bits are set. That way you get the benefit of hint bits for tuples that are frequently accessed and stay in buffer cache. But you don't spend any extra I/O to set them. I'd really like to see a worst-case scenario benchmark of a patch that does that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-12-22 at 15:30 +0200, Heikki Linnakangas wrote: > > I would vote to put this into 9.1 as a non-default option at restart, > > opening the door to other features which hint bits are frustrating. > > People can then choose between those features and the "power of hint > > bits". I think many people would choose db block checksums. > > Making it optional would add some ifs in the critical paths, possibly > making it slower. Hardly. A server-start parameter is going to be constant during execution and branch prediction will just snuff that away to nothing. > My gut feeling is that a reasonable compromise is to set hint bits like > we do today, but don't mark the page as dirty when only hint bits are > set. That way you get the benefit of hint bits for tuples that are > frequently accessed and stay in buffer cache. But you don't spend any > extra I/O to set them. I'd really like to see a worst-case scenario > benchmark of a patch that does that. That sounds great, but still prevents block checksums and that is a very valuable feature for robustness. This isn't a discussion about hint bits, its a discussion about opening the way for other features. ISTM there are other ways of optimising any clog issues that may remain, so clutching to this ancient optimisation has no further benefit for me. Merlin's idea seems to me to be original, useful *and* reasonable. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 22.12.2010 15:59, Simon Riggs wrote: > On Wed, 2010-12-22 at 15:30 +0200, Heikki Linnakangas wrote: >> My gut feeling is that a reasonable compromise is to set hint bits like >> we do today, but don't mark the page as dirty when only hint bits are >> set. That way you get the benefit of hint bits for tuples that are >> frequently accessed and stay in buffer cache. But you don't spend any >> extra I/O to set them. I'd really like to see a worst-case scenario >> benchmark of a patch that does that. > > That sounds great, but still prevents block checksums and that is a very > valuable feature for robustness. It does? The problem with block checksums is that if you modify a page and don't have a corresponding WAL record for it, like a hint bit update, you can have a torn page so that the checksum doesn't match. Refraining from dirtying the page when a hint bit is updated avoids the problem. With that change, we only ever write pages to disk that have a WAL record associated with it, with full-page images as necessary to avoid torn pages. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-12-22 at 16:22 +0200, Heikki Linnakangas wrote: > On 22.12.2010 15:59, Simon Riggs wrote: > > On Wed, 2010-12-22 at 15:30 +0200, Heikki Linnakangas wrote: > >> My gut feeling is that a reasonable compromise is to set hint bits like > >> we do today, but don't mark the page as dirty when only hint bits are > >> set. That way you get the benefit of hint bits for tuples that are > >> frequently accessed and stay in buffer cache. But you don't spend any > >> extra I/O to set them. I'd really like to see a worst-case scenario > >> benchmark of a patch that does that. > > > > That sounds great, but still prevents block checksums and that is a very > > valuable feature for robustness. > > It does? The problem with block checksums is that if you modify a page > and don't have a corresponding WAL record for it, like a hint bit > update, you can have a torn page so that the checksum doesn't match. > Refraining from dirtying the page when a hint bit is updated avoids the > problem. With that change, we only ever write pages to disk that have a > WAL record associated with it, with full-page images as necessary to > avoid torn pages. Which then leads to a block CRC not matching the block in memory. Sure, we can avoid CRC checking the hint bits, but that requires a much more expensive and complex CRC check. So what you suggest works only if we restrict CRC checking to blocks incoming to the buffer cache, but leaves us unable to do CRC checks on blocks once in the buffer cache. Since many blocks stay in cache almost constantly, we're left with the situation that the most heavily used parts of the database seldom get CRC checked. Postgres needs CRC checking more than it needs hint bits. I think we should allow this as an option, and if it proves to be an issue during beta then we can remove it before we go live, assuming we cannot get a reasonable alternate optimisation. I think its important for Postgres to implement this in the same release as sync rep. They complement each other: confirmed robustness. Exactly the features we need to prove to the rest of the world to trust us with their data. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Dec 22, 2010 at 9:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I think its important for Postgres to implement this in the same release > as sync rep. i.e. never, at the rate sync rep has been progressing for the last few months? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22.12.2010 16:52, Simon Riggs wrote: > On Wed, 2010-12-22 at 16:22 +0200, Heikki Linnakangas wrote: >> On 22.12.2010 15:59, Simon Riggs wrote: >>> On Wed, 2010-12-22 at 15:30 +0200, Heikki Linnakangas wrote: >>>> My gut feeling is that a reasonable compromise is to set hint bits like >>>> we do today, but don't mark the page as dirty when only hint bits are >>>> set. That way you get the benefit of hint bits for tuples that are >>>> frequently accessed and stay in buffer cache. But you don't spend any >>>> extra I/O to set them. I'd really like to see a worst-case scenario >>>> benchmark of a patch that does that. >>> >>> That sounds great, but still prevents block checksums and that is a very >>> valuable feature for robustness. >> >> It does? The problem with block checksums is that if you modify a page >> and don't have a corresponding WAL record for it, like a hint bit >> update, you can have a torn page so that the checksum doesn't match. >> Refraining from dirtying the page when a hint bit is updated avoids the >> problem. With that change, we only ever write pages to disk that have a >> WAL record associated with it, with full-page images as necessary to >> avoid torn pages. > > Which then leads to a block CRC not matching the block in memory. What do you mean? Do you envision that the CRC is calculated at every update, or only when a page is written out from the buffer cache? If the former, you could recalculate the CRC at a hint bit update too. If the latter, the hint bits are included in the page image that you checksum just like any other data. > So what you suggest works only if we restrict CRC checking to blocks > incoming to the buffer cache, but leaves us unable to do CRC checks on > blocks once in the buffer cache. Since many blocks stay in cache almost > constantly, we're left with the situation that the most heavily used > parts of the database seldom get CRC checked. There's plenty of stuff in memory that's not covered by an application-level CRC. That's what ECC RAM is for. Updating the CRC at every update to a page seems really expensive, but it's an orthogonal issue to hint bits. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 22, 2010 at 9:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > So what you suggest works only if we restrict CRC checking to blocks > incoming to the buffer cache, but leaves us unable to do CRC checks on > blocks once in the buffer cache. Since many blocks stay in cache almost > constantly, we're left with the situation that the most heavily used > parts of the database seldom get CRC checked. With this statement, you just moved the goal posts on the checksumming ideas. In fact, you didn't just move the goal posts, you picked the ball up and teleported it to another stadium. I believe that most of the people talking about and wanting checksums so far have been wanting them to verify I/O, not to verify that PG has no bugs, that RAM is staying charged correctly, and that no stray bits have been flipped, and that nobody else happens to be scribbling over our shared buffers. Being able to arbitrary (i.e at any point in time) prove that the shared buffers contents are exactly what they should be may be a worthy goal, but that's many orders of magnitude more difficult than verifying that the bytes we read from disk are the ones we wrote to disk. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Wed, 2010-12-22 at 17:01 +0200, Heikki Linnakangas wrote: > On 22.12.2010 16:52, Simon Riggs wrote: > > On Wed, 2010-12-22 at 16:22 +0200, Heikki Linnakangas wrote: > >> On 22.12.2010 15:59, Simon Riggs wrote: > >>> On Wed, 2010-12-22 at 15:30 +0200, Heikki Linnakangas wrote: > >>>> My gut feeling is that a reasonable compromise is to set hint bits like > >>>> we do today, but don't mark the page as dirty when only hint bits are > >>>> set. That way you get the benefit of hint bits for tuples that are > >>>> frequently accessed and stay in buffer cache. But you don't spend any > >>>> extra I/O to set them. I'd really like to see a worst-case scenario > >>>> benchmark of a patch that does that. > >>> > >>> That sounds great, but still prevents block checksums and that is a very > >>> valuable feature for robustness. > >> > >> It does? The problem with block checksums is that if you modify a page > >> and don't have a corresponding WAL record for it, like a hint bit > >> update, you can have a torn page so that the checksum doesn't match. > >> Refraining from dirtying the page when a hint bit is updated avoids the > >> problem. With that change, we only ever write pages to disk that have a > >> WAL record associated with it, with full-page images as necessary to > >> avoid torn pages. > > > > Which then leads to a block CRC not matching the block in memory. > Do you envision that the CRC is calculated at every update, or only when > a page is written out from the buffer cache? At every update, so there is a clear assertion that the CRC matches the block. > If the former, you could > recalculate the CRC at a hint bit update too. If the latter, the hint > bits are included in the page image that you checksum just like any > other data. If we didn't have hint bits, we wouldn't need to recalculate the CRC each time one was updated... > > So what you suggest works only if we restrict CRC checking to blocks > > incoming to the buffer cache, but leaves us unable to do CRC checks on > > blocks once in the buffer cache. Since many blocks stay in cache almost > > constantly, we're left with the situation that the most heavily used > > parts of the database seldom get CRC checked. > > There's plenty of stuff in memory that's not covered by an > application-level CRC. That's what ECC RAM is for. http://www.google.com/research/pubs/archive/35162.pdf Google research shows that each DIMM has an 8% chance per annum of uncorrectable memory errors, even on ECC. If you have large RAM, like everybody now does, your incidence of this type of error will be much higher than it was in previous years, so our perception of what is necessary now to protect databases is out of date. We have data under our care, and will be much more likely to receive this kind of error because of the amount of RAM we use. > Updating the CRC at > every update to a page seems really expensive, but it's an orthogonal > issue to hint bits. Clearly, the frequency with which we set hint bits affects the frequency we can sensibly update CRCs. It shouldn't be up to us to decide how much protection a user wants to give their data. There might be two or three settings that make sense, but clearly we need to be able to limit hint-bit setting to allow us to have a usable CRC check. So there is a very string connection between turning this optimisation off and gaining CRC checking as a feature. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 22.12.2010 17:31, Simon Riggs wrote: > On Wed, 2010-12-22 at 17:01 +0200, Heikki Linnakangas wrote: >> Do you envision that the CRC is calculated at every update, or only when >> a page is written out from the buffer cache? > > At every update, so there is a clear assertion that the CRC matches the > block. Umm, when do you check the CRC? Every time the page is locked? Every time it's updated? If don't verify the CRC, what is it good for? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 22.12.2010 17:31, Simon Riggs wrote: > On Wed, 2010-12-22 at 17:01 +0200, Heikki Linnakangas wrote: >> There's plenty of stuff in memory that's not covered by an >> application-level CRC. That's what ECC RAM is for. > > http://www.google.com/research/pubs/archive/35162.pdf > > Google research shows that each DIMM has an 8% chance per annum of > uncorrectable memory errors, even on ECC. You misread that paper. From summary: > About a third of machines and over 8% of DIMMs in > our fleet saw at least one *correctable* error per year. Emphasis mine. > Our > per-DIMM rates of correctable errors translate to an aver- > age of 25,000–75,000 FIT (failures in time per billion hours > of operation) per Mbit and a median FIT range of 778 – > 25,000 per Mbit (median for DIMMs with errors), while pre- > vious studies report 200-5,000 FIT per Mbit. The number of > correctable errors per DIMM is highly variable, with some > DIMMs experiencing a huge number of errors, compared to > others. The annual incidence of uncorrectable errors was > 1.3% per machine and 0.22% per DIMM. So the real figure of uncorrectable errors is 0.22% per DIMM. Anyway, unreliable RAM calls for more ECC bits in DIMMs, not invasive architectural changes to every single application in the system. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Aidan Van Dyk <aidan@highrise.ca> writes: > With this statement, you just moved the goal posts on the checksumming > ideas. In fact, you didn't just move the goal posts, you picked the > ball up and teleported it to another stadium. What he said. I can't imagine that anyone will be interested in any case other than "set the CRC immediately before writing, and check it upon first reading the page in". Maintaining it continuously while the page is in shared memory is completely insane from a cost-versus-benefit perspective. regards, tom lane
On Wed, 2010-12-22 at 10:45 -0500, Tom Lane wrote: > Aidan Van Dyk <aidan@highrise.ca> writes: > > With this statement, you just moved the goal posts on the checksumming > > ideas. In fact, you didn't just move the goal posts, you picked the > > ball up and teleported it to another stadium. > > What he said. I can't imagine that anyone will be interested in any > case other than "set the CRC immediately before writing, and check it > upon first reading the page in". Maintaining it continuously while the > page is in shared memory is completely insane from a cost-versus-benefit > perspective. If you insist on setting hint-bits, then that is probably true. Many people experience almost no I/O these days, and there's a strong correlation between people caring about their data and also being willing to spend big $s on cache. We need to protect our users, however much money they spent on cache; I would argue the more money they spent on cache the harder we should be trying to protect them. I'm sure it will take a little while for everybody to understand why a full CRC implementation is both necessary and now possible. Paradigm shifts of thought do seem like teleports, but they can be beneficial. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Dec 22, 2010 at 10:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I'm sure it will take a little while for everybody to understand why a > full CRC implementation is both necessary and now possible. Paradigm > shifts of thought do seem like teleports, but they can be beneficial. But please don't deny the rest of us airbags while you keep working on teleportation ;-) a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > My gut feeling is that a reasonable compromise is to set hint bits like > we do today, but don't mark the page as dirty when only hint bits are > set. That way you get the benefit of hint bits for tuples that are > frequently accessed and stay in buffer cache. But you don't spend any > extra I/O to set them. I think it's far more likely that that could be acceptable than the radical method of removing hint bits altogether. I have not looked into what's wrong with Merlin's test case, but my thinking about it goes like this: we know that contention for buffer lookup is significant at high loads, despite the facts that the accesses are distributed across a lot of independently-usable buffers and we've done much work to partition the lookup locks. If we remove hint bits and thereby force an access to clog for every tuple touch, we can expect that the contention for clog access will be comparable to the worst case for buffer access contention ... except that in many cases, it will be distributed across far fewer pages and so the actual interference rate will be far higher. This will make our past experiences with "context swap storms" look like a day at the beach. regards, tom lane
On Wed, 2010-12-22 at 17:42 +0200, Heikki Linnakangas wrote: > On 22.12.2010 17:31, Simon Riggs wrote: > > On Wed, 2010-12-22 at 17:01 +0200, Heikki Linnakangas wrote: > >> There's plenty of stuff in memory that's not covered by an > >> application-level CRC. That's what ECC RAM is for. > > > > http://www.google.com/research/pubs/archive/35162.pdf > > > > Google research shows that each DIMM has an 8% chance per annum of > > uncorrectable memory errors, even on ECC. > > You misread that paper. From summary: I read the paper in detail before I posted. If you think that finding an error in my quote disproves anything, you should read the whole paper. I see this: Conclusion 1 "... Nonetheless, the remaining incidence of 0.22% per DIMM per year makes a crash-tolerant application layer indispens- able for large-scale server farms." What you are arguing for is a protection system that will reduce in effectiveness as we add more cache. What I am arguing in favour of is an option to allow people to protect their data, whatever the size of their cache. I'm not forcing you or anyone to use it, but I think its an important option to be offering to our users. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Dec 22, 2010 at 10:55 AM, Aidan Van Dyk <aidan@highrise.ca> wrote: > On Wed, Dec 22, 2010 at 10:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> I'm sure it will take a little while for everybody to understand why a >> full CRC implementation is both necessary and now possible. Paradigm >> shifts of thought do seem like teleports, but they can be beneficial. > > But please don't deny the rest of us airbags while you keep working on > teleportation ;-) well, simon's point that hint bits complicate checksum may nor may not be the case, but no hint bits = less i/o = less checksumming (unless you checksum around the hint bits). This lowers the expense of doing it, which is nice. Maybe that doesn't matter in the end, we'll see. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > well, simon's point that hint bits complicate checksum may nor may not > be the case, but no hint bits = less i/o = less checksumming (unless > you checksum around the hint bits). I think you're optimistically assuming the extra clog accesses don't cost any I/O. regards, tom lane
On Wed, Dec 22, 2010 at 10:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> My gut feeling is that a reasonable compromise is to set hint bits like >> we do today, but don't mark the page as dirty when only hint bits are >> set. That way you get the benefit of hint bits for tuples that are >> frequently accessed and stay in buffer cache. But you don't spend any >> extra I/O to set them. > > I think it's far more likely that that could be acceptable than the > radical method of removing hint bits altogether. > > I have not looked into what's wrong with Merlin's test case, but my > thinking about it goes like this: we know that contention for buffer > lookup is significant at high loads, despite the facts that the accesses > are distributed across a lot of independently-usable buffers and we've > done much work to partition the lookup locks. If we remove hint bits > and thereby force an access to clog for every tuple touch, we can expect > that the contention for clog access will be comparable to the worst case > for buffer access contention ... except that in many cases, it will be > distributed across far fewer pages and so the actual interference rate > will be far higher. This will make our past experiences with "context > swap storms" look like a day at the beach. right. note I'm not suggesting they they should actually be removed, at least not yet. I was just playing around and noticed that the cost of not having them is not immediately obvious in highly synthetic tests. The cost of clog access in best case scenario appears to be near zero, which I thought was interesting enough to point out. What I'm after here is the worst case scenario, how likely it is to happen, and looking into possible remedies (if any). I'm going to do lots more testing over the holidays. I'm fishing for ideas on good ways to flesh things out more. merlin
On Wed, Dec 22, 2010 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> well, simon's point that hint bits complicate checksum may nor may not >> be the case, but no hint bits = less i/o = less checksumming (unless >> you checksum around the hint bits). > > I think you're optimistically assuming the extra clog accesses don't > cost any I/O. right, but clog is much more highly packed which is both a good and a bad thing. my conjecture here is that jamming the clog files is actually good, because that keeps them 'hot' and more than compensates the extra heap i/o. the extra lock of course is scary. here's the thing, compared to the 90's when they were put in, the transaction space has shrunk by half and we put gigabytes, not megabytes of memory into servers. what does this mean for the clog? that's what i'm after. merlin
On Wed, Dec 22, 2010 at 11:12 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Dec 22, 2010 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Merlin Moncure <mmoncure@gmail.com> writes: >>> well, simon's point that hint bits complicate checksum may nor may not >>> be the case, but no hint bits = less i/o = less checksumming (unless >>> you checksum around the hint bits). >> >> I think you're optimistically assuming the extra clog accesses don't >> cost any I/O. > > right, but clog is much more highly packed which is both a good and a > bad thing. my conjecture here is that jamming the clog files is > actually good, because that keeps them 'hot' and more than compensates > the extra heap i/o. the extra lock of course is scary. er, should have said, plus less heap i/o compensates the extra clog i/o. merlin
On Wed, 2010-12-22 at 10:59 -0500, Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > My gut feeling is that a reasonable compromise is to set hint bits like > > we do today, but don't mark the page as dirty when only hint bits are > > set. That way you get the benefit of hint bits for tuples that are > > frequently accessed and stay in buffer cache. But you don't spend any > > extra I/O to set them. > > I think it's far more likely that that could be acceptable than the > radical method of removing hint bits altogether. I haven't argued to remove them, just have an option to not set them. > I have not looked into what's wrong with Merlin's test case, but my > thinking about it goes like this: we know that contention for buffer > lookup is significant at high loads, despite the facts that the accesses > are distributed across a lot of independently-usable buffers and we've > done much work to partition the lookup locks. If we remove hint bits > and thereby force an access to clog for every tuple touch, we can expect > that the contention for clog access will be comparable to the worst case > for buffer access contention ... except that in many cases, it will be > distributed across far fewer pages and so the actual interference rate > will be far higher. This will make our past experiences with "context > swap storms" look like a day at the beach. I think you're right, but I also think there are other ways we could optimise that other than hint bits. For example, the single item cache might be changed, or we might buffer/batch clog updates, or we might use a hash table of known aborted transactions etc. As Merlin points out, we don't have much evidence for their value or lack of value, so we need a parameter to allow wide scale testing. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Merlin Moncure <mmoncure@gmail.com> writes: > I'm going to do lots more testing over the holidays. I'm fishing for > ideas on good ways to flesh things out more. Based on the analogy to past bufmgr contention problems, I'd suggest going back through the archives to look for the test cases associated with context swap storm discussions. The cases themselves might not be quite right for this, but they'd at least show a structure for stressing things at the tuple-access level. regards, tom lane
On Wed, Dec 22, 2010 at 04:00:30PM +0000, Simon Riggs wrote: > On Wed, 2010-12-22 at 17:42 +0200, Heikki Linnakangas wrote: > > On 22.12.2010 17:31, Simon Riggs wrote: > > > On Wed, 2010-12-22 at 17:01 +0200, Heikki Linnakangas wrote: > > >> There's plenty of stuff in memory that's not covered by an > > >> application-level CRC. That's what ECC RAM is for. > > > > > > http://www.google.com/research/pubs/archive/35162.pdf > > > > > > Google research shows that each DIMM has an 8% chance per annum of > > > uncorrectable memory errors, even on ECC. > > > > You misread that paper. From summary: > > I read the paper in detail before I posted. If you think that finding an > error in my quote disproves anything, you should read the whole paper. I > see this: > > Conclusion 1 > "... Nonetheless, the remaining incidence of 0.22% per DIMM > per year makes a crash-tolerant application layer indispens- > able for large-scale server farms." > > What you are arguing for is a protection system that will reduce in > effectiveness as we add more cache. > > What I am arguing in favour of is an option to allow people to protect > their data, whatever the size of their cache. I'm not forcing you or > anyone to use it, but I think its an important option to be offering to > our users. For what version of PostgreSQL are you proposing that we provide this protection? Let's assume that it's before 10.0 so we can get some idea of how this will arise :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 22.12.2010 18:12, Merlin Moncure wrote: > On Wed, Dec 22, 2010 at 11:06 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> Merlin Moncure<mmoncure@gmail.com> writes: >>> well, simon's point that hint bits complicate checksum may nor may not >>> be the case, but no hint bits = less i/o = less checksumming (unless >>> you checksum around the hint bits). >> >> I think you're optimistically assuming the extra clog accesses don't >> cost any I/O. > > right, but clog is much more highly packed which is both a good and a > bad thing. As a sidenote: note that the clog is not currently CRC'd. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> right -- see the attached clog_stress.sql above. It creates a script > that inserts records in blocks of 10000, deletes half of them, and > vacuums. Neither the execution of the script nor a seq scan following > its execution showed an interesting performance difference (which I am > arbitrarily calling 5% in either direction). Like I said though, I > don't trust the patch or the results yet. Given that DBT2 stressed the bufrmgr contention pretty well, it seems like it'd be worth trying this for hint bits in the test servers. We should see if Mark Wong can do this in the new year. I might be able to test on some client workloads. We'll see; currently I lack the harness to simulate a high level of client contention. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 23/12/10 05:06, Merlin Moncure wrote: > On Wed, Dec 22, 2010 at 10:59 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >>> My gut feeling is that a reasonable compromise is to set hint bits like >>> we do today, but don't mark the page as dirty when only hint bits are >>> set. That way you get the benefit of hint bits for tuples that are >>> frequently accessed and stay in buffer cache. But you don't spend any >>> extra I/O to set them. >> I think it's far more likely that that could be acceptable than the >> radical method of removing hint bits altogether. >> >> I have not looked into what's wrong with Merlin's test case, but my >> thinking about it goes like this: we know that contention for buffer >> lookup is significant at high loads, despite the facts that the accesses >> are distributed across a lot of independently-usable buffers and we've >> done much work to partition the lookup locks. If we remove hint bits >> and thereby force an access to clog for every tuple touch, we can expect >> that the contention for clog access will be comparable to the worst case >> for buffer access contention ... except that in many cases, it will be >> distributed across far fewer pages and so the actual interference rate >> will be far higher. This will make our past experiences with "context >> swap storms" look like a day at the beach. > right. note I'm not suggesting they they should actually be removed, > at least not yet. I was just playing around and noticed that the cost > of not having them is not immediately obvious in highly synthetic > tests. The cost of clog access in best case scenario appears to be > near zero, which I thought was interesting enough to point out. What > I'm after here is the worst case scenario, how likely it is to happen, > and looking into possible remedies (if any). > > I'm going to do lots more testing over the holidays. I'm fishing for > ideas on good ways to flesh things out more. > > Certainly having a choice about configuring them would be a good addition in itself, e.g for data warehousing use the hint bits can be a considerable impediment so the *ability* to not have them would be a huge advantage. if I have time over the early new year I'll do some testing too. Cheers Mark
> I believe that most of the people talking about and wanting checksums > so far have been wanting them to verify I/O, not to verify that PG has > no bugs, that RAM is staying charged correctly, and that no stray bits > have been flipped, and that nobody else happens to be scribbling over > our shared buffers. I agree that this should be our first goal. Yes, we want to protect users against memory errors as well. However, that's a much tougher feature to implement; I've done some hashing this out with engineers on other DBMSes and nobody has good answers right now. The overhead of what Simon proposes would be enormous, and few users would be interested in paying that cost. Doing a CRC check-on-write, as well as checking for format corruption before write would catch a majority of real-world problems. Please don't hold that up in pursuit of the bit-flipping problem, which *nobody* has solved. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
> Certainly having a choice about configuring them would be a good > addition in itself, e.g for data warehousing use the hint bits can be a > considerable impediment so the *ability* to not have them would be a > huge advantage. Would need to be a restart option, no? Regarding the contention which Tom expects: the extra load on the CLOG would be 100% reads, no? If it's *all* reads, why would we have any more contention than we have now? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > Regarding the contention which Tom expects: the extra load on the CLOG > would be 100% reads, no? If it's *all* reads, why would we have any > more contention than we have now? Read involves sharelock which still causes contention. Those bufmgr contention storms we saw before were completely independent of whether the pages were accessed for read or for write. Another thing to keep in mind is that the current clog access code is designed on the assumption that there's considerable locality of access to pg_clog, ie, you usually only need to consult it for recent XIDs because older ones have been hinted. Turn off hint bits, that behavior goes out the window. regards, tom lane
Josh Berkus <josh@agliodbs.com> writes: > I might be able to test on some client workloads. We'll see; currently > I lack the harness to simulate a high level of client contention. We're pretty successful in doing that with Tsung, even against large clusters of plproxy nodes. http://tsung.erlang-projects.org/ http://archives.postgresql.org/pgsql-admin/2008-12/msg00032.php Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 23/12/10 10:54, Tom Lane wrote: > Josh Berkus<josh@agliodbs.com> writes: >> Regarding the contention which Tom expects: the extra load on the CLOG >> would be 100% reads, no? If it's *all* reads, why would we have any >> more contention than we have now? > Read involves sharelock which still causes contention. Those bufmgr > contention storms we saw before were completely independent of whether > the pages were accessed for read or for write. > > Another thing to keep in mind is that the current clog access code is > designed on the assumption that there's considerable locality of access > to pg_clog, ie, you usually only need to consult it for recent XIDs > because older ones have been hinted. Turn off hint bits, that behavior > goes out the window. Would a larger (or configurable) clog cache help with this tho? Cheers Mark
On Wed, 2010-12-22 at 22:08 +0200, Heikki Linnakangas wrote: > On 22.12.2010 18:12, Merlin Moncure wrote: > > On Wed, Dec 22, 2010 at 11:06 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > >> Merlin Moncure<mmoncure@gmail.com> writes: > >>> well, simon's point that hint bits complicate checksum may nor may not > >>> be the case, but no hint bits = less i/o = less checksumming (unless > >>> you checksum around the hint bits). > >> > >> I think you're optimistically assuming the extra clog accesses don't > >> cost any I/O. > > > > right, but clog is much more highly packed which is both a good and a > > bad thing. > > As a sidenote: note that the clog is not currently CRC'd. Good point, thanks for mentioning it. With 64kB of clog buffers and potentially 8 GB of shared_buffers, which is about 10^5 more RAM for shared_buffers. So a protection mechanism for shared_buffers will trap about 99.999% of RAM errors. We might say that an error in clog could have a serious effect, and I would agree. I don't see a way around that though, except for a CRC check when we write to disk. My understanding is that the context switch storms were because of the I/O involved with thrashing the clog buffers. (Well, actually, I think it was subtrans, but sane difference). To solve that, we could just swap them out to shared_buffers with usage = 5 rather than evict them. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Dec 22, 2010 at 4:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> Regarding the contention which Tom expects: the extra load on the CLOG >> would be 100% reads, no? If it's *all* reads, why would we have any >> more contention than we have now? > > Read involves sharelock which still causes contention. Those bufmgr > contention storms we saw before were completely independent of whether > the pages were accessed for read or for write. > > Another thing to keep in mind is that the current clog access code is > designed on the assumption that there's considerable locality of access > to pg_clog, ie, you usually only need to consult it for recent XIDs > because older ones have been hinted. Turn off hint bits, that behavior > goes out the window. That's not always going to be the case though. In olap-ish environments you will see cases of scans over many records that come from a single transaction. This is also the case where hint bits can really drill you -- you insert a bunch of records, log the bits, delete, log the bits, and vacuum eventually. I started investigating this on behalf of a friend who is experiencing basically the worst case with regularity. merlin