Thread: WAL replay bugs
I've been playing with a little hack that records a before and after image of every page modification that is WAL-logged, and writes the images to a file along with the LSN of the corresponding WAL record. I set up a master-standby replication with that hack in place in both servers, and ran the regression suite. Then I compared the after images after every WAL record, as written on master, and as replayed by the standby. The idea is that the page content in the standby after replaying a WAL record should be identical to the page in the master, when the WAL record was generated. There are some known cases where that doesn't hold, but it's a useful sanity check. To reduce noise, I've been focusing on one access method at a time, filtering out others. I did that for GIN first, and indeed found a bug in my new incomplete-split code, see commit 594bac42. After fixing that, and zeroing some padding bytes (38a2b95c), I'm now getting a clean run with that. Next, I took on GiST, and lo-and-behold found a bug there pretty quickly as well. This one has been there ever since we got Hot Standby: the redo of a page update (e.g an insertion) resets the right-link of the page. If there is a concurrent scan, in a hot standby server, that scan might still need the rightlink, and will hence miss some tuples. This can be reproduced like this: 1. in master, create test table. CREATE TABLE gisttest (id int4); CREATE INDEX gisttest_idx ON gisttest USING gist (id); INSERT INTO gisttest SELECT g * 1000 from generate_series(1, 100000) g; -- Test function. Starts a scan, fetches one row from it, then waits 10 seconds until fetching the rest of the rows. -- Returns the number of rows scanned. Should be 100000 if you follow -- these test instructions. CREATE OR REPLACE FUNCTION gisttestfunc() RETURNS int AS $$ declare i int4; t text; cur CURSOR FOR SELECT 'foo' FROM gisttest WHERE id >= 0; begin set enable_seqscan=off; set enable_bitmapscan=off; i = 0; OPEN cur; FETCH cur INTO t; perform pg_sleep(10); LOOP EXIT WHEN NOT FOUND; -- this is bogus on first iterationi = i + 1; FETCH cur INTO t; END LOOP; CLOSE cur; RETURN i; END; $$ LANGUAGE plpgsql; 2. in standby SELECT gisttestfunc(); <blocks> 3. Quickly, before the scan in standby continues, cause some page splits: INSERT INTO gisttest SELECT g * 1000+1 from generate_series(1, 100000) g; 4. The scan in standby finishes. It should return 100000, but will return a lower number if you hit the bug. At a quick glance, I think fixing that is just a matter of not resetting the right-link. I'll take a closer look tomorrow, but for now I just wanted to report what I've been doing. I'll post the scripts I've been using later too - nag me if I don't. - Heikki
On 04/07/2014 02:16 PM, Heikki Linnakangas wrote: > I've been playing with a little hack that records a before and after > image of every page modification that is WAL-logged, and writes the > images to a file along with the LSN of the corresponding WAL record. I > set up a master-standby replication with that hack in place in both > servers, and ran the regression suite. Then I compared the after images > after every WAL record, as written on master, and as replayed by the > standby. This is awesome ... thank you for doing this. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Apr 8, 2014 at 3:16 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > > I've been playing with a little hack that records a before and after image > of every page modification that is WAL-logged, and writes the images to a > file along with the LSN of the corresponding WAL record. I set up a > master-standby replication with that hack in place in both servers, and ran > the regression suite. Then I compared the after images after every WAL > record, as written on master, and as replayed by the standby. Assuming that adding some dedicated hooks in the core able to do actions before and after a page modification occur is not *that* costly (well I imagine that it is not acceptable in terms of performance), could it be possible to get that in the shape of a extension that could be used to test WAL record consistency? This may be an idea to think about... -- Michael
I executed given steps many times to produce this bug. But still I unable to hit this bug. I used attached scripts to produce this bug. Can I get scripts to produce this bug? wal_replay_bug.sh <http://postgresql.1045698.n5.nabble.com/file/n5799512/wal_replay_bug.sh> ----- Thanks and Regards, Sachin Kotwal NTT-DATA-OSS Center (Pune) -- View this message in context: http://postgresql.1045698.n5.nabble.com/WAL-replay-bugs-tp5799053p5799512.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On 04/10/2014 10:52 AM, sachin kotwal wrote: > > I executed given steps many times to produce this bug. > But still I unable to hit this bug. > I used attached scripts to produce this bug. > > Can I get scripts to produce this bug? > > > wal_replay_bug.sh > <http://postgresql.1045698.n5.nabble.com/file/n5799512/wal_replay_bug.sh> Oh, I can't reproduce it using that script either. I must've used some variation of it, and posted wrong script. The attached seems to do the trick. I changed the INSERT statements slightly, so that all the new rows have the same key. Thanks for verifying this! - Heikki
Attachment
<div dir="ltr"><div class="gmail_extra"><br /><br /><div class="gmail_quote">On Thu, Apr 10, 2014 at 6:21 PM, Heikki Linnakangas<span dir="ltr"><<a href="mailto:hlinnakangas@vmware.com" target="_blank">hlinnakangas@vmware.com</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px#ccc solid;padding-left:1ex">On 04/10/2014 10:52 AM, sachin kotwal wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br /> I executed given steps manytimes to produce this bug.<br /> But still I unable to hit this bug.<br /> I used attached scripts to produce this bug.<br/><br /> Can I get scripts to produce this bug?<br /><br /><br /></blockquote><br /> Oh, I can't reproduce it usingthat script either. I must've used some variation of it, and posted wrong script.<br /><br /> The attached seems todo the trick. I changed the INSERT statements slightly, so that all the new rows have the same key.<br /><br /> Thanksfor verifying this!<span class="HOEnZb"><font color="#888888"><br /></font></span></blockquote></div><br /></div><divclass="gmail_extra">Thanks to explain the case to produce this bug.</div><div class="gmail_extra">I am able toproduce this bug by using latest scripts from last mail.</div><div class="gmail_extra">I applied patch submitted for thisbug and re-run the scripts.</div><div class="gmail_extra">Now it is giving correct result.</div><div class="gmail_extra"><br/></div><div class="gmail_extra"><br /></div><div class="gmail_extra">Thanks and Regards,</div><divclass="gmail_extra"><br /></div><div class="gmail_extra">Sachin Kotwal</div></div>
On 04/08/2014 06:41 AM, Michael Paquier wrote: > On Tue, Apr 8, 2014 at 3:16 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> >> I've been playing with a little hack that records a before and after image >> of every page modification that is WAL-logged, and writes the images to a >> file along with the LSN of the corresponding WAL record. I set up a >> master-standby replication with that hack in place in both servers, and ran >> the regression suite. Then I compared the after images after every WAL >> record, as written on master, and as replayed by the standby. > Assuming that adding some dedicated hooks in the core able to do > actions before and after a page modification occur is not *that* > costly (well I imagine that it is not acceptable in terms of > performance), could it be possible to get that in the shape of a > extension that could be used to test WAL record consistency? This may > be an idea to think about... Yeah, working on it. It can live as a patch set if nothing else. This has been very fruitful, I just committed another fix for a bug I found with this earlier today. There are quite a few things that cause differences between master and standby. We have hint bits in many places, unused space that isn't zeroed etc. Two things that are not bugs, but I'd like to change just to make this tool easier to maintain, and to generally clean things up: 1. When creating a sequence, we first use simple_heap_insert() to insert the sequence tuple, which creates a WAL record. Then we write a new sequence RM WAL record about the same thing. The reason is that the WAL record written by regular heap_insert is bogus for a sequence tuple. After replaying just the heap insertion, but not the other record, the page doesn't have the magic value indicating that it's a sequence, i.e. it's broken as a sequence page. That's OK because we only do this when creating a new sequence, so if we crash between those two records, the whole relation is not visible to anyone. Nevertheless, I'd like to fix that by using PageAddItem directly to insert the tuple, instead of simple_heap_insert. We have to override the xmin field of the tuple anyway, and we don't need any of the other services like finding the insert location, toasting, visibility map or freespace map updates, that simple_heap_insert() provides. 2. _bt_restore_page, when restoring a B-tree page split record. It adds tuples to the page in reverse order compared to how it's done in master. There is a comment noting that, and it asks "Is it worth changing just on general principles?". Yes, I think it is. Any objections to changing those two? - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > Two things that are not bugs, but I'd like to change just to make this > tool easier to maintain, and to generally clean things up: > 1. When creating a sequence, we first use simple_heap_insert() to insert > the sequence tuple, which creates a WAL record. Then we write a new > sequence RM WAL record about the same thing. The reason is that the WAL > record written by regular heap_insert is bogus for a sequence tuple. > After replaying just the heap insertion, but not the other record, the > page doesn't have the magic value indicating that it's a sequence, i.e. > it's broken as a sequence page. That's OK because we only do this when > creating a new sequence, so if we crash between those two records, the > whole relation is not visible to anyone. Nevertheless, I'd like to fix > that by using PageAddItem directly to insert the tuple, instead of > simple_heap_insert. We have to override the xmin field of the tuple > anyway, and we don't need any of the other services like finding the > insert location, toasting, visibility map or freespace map updates, that > simple_heap_insert() provides. > 2. _bt_restore_page, when restoring a B-tree page split record. It adds > tuples to the page in reverse order compared to how it's done in master. > There is a comment noting that, and it asks "Is it worth changing just > on general principles?". Yes, I think it is. > Any objections to changing those two? Not here. I've always suspected #2 was going to bite us someday anyway. regards, tom lane
On Thu, Apr 17, 2014 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Any objections to changing those two? > > Not here. I've always suspected #2 was going to bite us someday anyway. +1 -- Peter Geoghegan
On 04/17/2014 07:59 PM, Heikki Linnakangas wrote: > On 04/08/2014 06:41 AM, Michael Paquier wrote: >> On Tue, Apr 8, 2014 at 3:16 AM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> >>> I've been playing with a little hack that records a before and after image >>> of every page modification that is WAL-logged, and writes the images to a >>> file along with the LSN of the corresponding WAL record. I set up a >>> master-standby replication with that hack in place in both servers, and ran >>> the regression suite. Then I compared the after images after every WAL >>> record, as written on master, and as replayed by the standby. >> Assuming that adding some dedicated hooks in the core able to do >> actions before and after a page modification occur is not *that* >> costly (well I imagine that it is not acceptable in terms of >> performance), could it be possible to get that in the shape of a >> extension that could be used to test WAL record consistency? This may >> be an idea to think about... > > Yeah, working on it. It can live as a patch set if nothing else. > > This has been very fruitful, I just committed another fix for a bug I > found with this earlier today. > > There are quite a few things that cause differences between master and > standby. We have hint bits in many places, unused space that isn't > zeroed etc. [a few more fixed bugs later] Ok, I'm now getting clean output when running the regression suite with this tool. And here is the tool itself. It consists of two parts: 1. Modifications to the backend to write the page images 2. A post-processing tool to compare the logged images between master and standby. The attached diff contains both parts. The postprocessing tool is in contrib/page_image_logging. See contrib/page_image_logging/README for instructions. Let me know if you have any questions or need further help running the tool. I've also pushed this to my git repository at git://git.postgresql.org/git/users/heikki/postgres.git, branch "page_image_logging". I intend to keep it up-to-date with current master. This is a pretty ugly hack, so I'm not proposing to commit this in the current state. But perhaps this could be done more cleanly, by adding some hooks in the backend as Michael suggested. - Heikki
Attachment
On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > And here is the tool itself. It consists of two parts: > > 1. Modifications to the backend to write the page images > 2. A post-processing tool to compare the logged images between master and > standby. Having that into Postgres at the disposition of developers would be great, and I believe that it would greatly reduce the occurrence of bugs caused by WAL replay during recovery. So, with the permission of the author, I have been looking at this facility for a cleaner integration into Postgres. Roughly, this utility is made of three parts: 1) A set of masking functions that can be used on page images to normalize them. This is used to put magic numbers or enforce flag values to make page content consistent across nodes. This is for example the case of the free space between pd_lower and pd_upper, pd_flags, etc. Of course this depends on the type of page (btree, heap, etc.). 2) Facility to memorize, analyze if they have been modified, and flush page images to a dedicated file. This interacts with the buffer manager mainly. 3) Facility to reorder page images within the same WAL record as master/standby may not write them in the same order on a standby or a master due to for example lock released in different order. This is part of the binary analyzing the diffs between master and standby. As of now, 2) is integrated in the backend, 1) and 3) are part of the contrib module. However I am thinking that 1) and 2) should be done in core using an ifdef similar to CLOBBER_FREED_MEMORY, to mask the page images and write them in a dedicated file (in global/ ?), while 3) would be fine as a separate binary in contrib/. An essential thing to add would be to have a set of regression tests that developers and buildfarm machines could directly use. Perhaps there are parts of what is proposed here that could be made more generalized, like the masking functions. So do not hesitate if you have any opinion on the matter. Regards, -- Michael
On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: > Perhaps there are parts of what is proposed here that could be made > more generalized, like the masking functions. So do not hesitate if > you have any opinion on the matter. OK, attached is the result of this hacking: Buffer capture facility: check WAL replay consistency It is a tool aimed to be used by developers and buildfarm machines that can be used to check for consistency at page level when replaying WAL files among several nodes of a cluster (generally master and standby node). This facility is made of two parts: - A server part, where all the changes happening at page level are captured and inserted in a file called buffer_captures located at the root of PGDATA. Each buffer entry is masked to make the comparison across node consistent (flags like hint bits for example) and then each buffer is captured is with the following format as a single line of the output file: LSN: %08X/%08X page: PAGE_IN_HEXA Hexadecimal format makes it easier to detect differences between pages, and format is chosen to facilitate comparison between buffer entries. - A client part, located in contrib/buffer_capture_cmp, that can be used to compare buffer captures between nodes. The footprint on core code is minimal and is controlled by a symbol called BUFFER_CAPTURE that needs to be set at build time to enable the buffer capture at server level. If this symbol is not enabled, both server and client parts are idle and generate nothing. Note that this facility can generate a lot of output (11G when running regression tests, counting double when using both master and standby). contrib/buffer_capture_cmp contains a regression test facility easing testing with buffer captures. The user just needs to run "make check" in this folder... There is a default set of tests saved in test-default.sh but user is free to set up custom tests by creating a file called test-custom.sh that can be kicked by the test facility if this file is present instead of the defaults. Patch will be added to the first commit fest as well. Note that the footprint on core code is limited, so even if there is more than 1k lines of codes, review is simpler than it looks. A couple of things to note though: 1) In order to detect if a page is used for a sequence, SEQ_MAGIC needs to be exposed in sequence.h. This is included in the patch attached but perhaps this should be changed as a separate patch 2) Regression test facility uses some useful parts taken from pg_upgrade. I think that we should gather those parts in a common place (contrib/common?). This can facilitate the integration of other modules using regression based on bash scripts. 3) While hacking this facility, I noticed that some ItemId entries in btree pages could be inconsistent between master and standby. Those items are masked in the current patch, but it looks like a bug of Postgres itself. Documentation is added in the code itself, I didn't feel any need to expose this facility the lambda users in doc/src/sgml... Regards, -- Michael
Attachment
On 06/13/2014 10:14 AM, Michael Paquier wrote: > On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >> Perhaps there are parts of what is proposed here that could be made >> more generalized, like the masking functions. So do not hesitate if >> you have any opinion on the matter. > OK, attached is the result of this hacking: > > Buffer capture facility: check WAL replay consistency > > It is a tool aimed to be used by developers and buildfarm machines > that can be used to check for consistency at page level when replaying > WAL files among several nodes of a cluster (generally master and > standby node). > > This facility is made of two parts: > - A server part, where all the changes happening at page level are > captured and inserted in a file called buffer_captures located at the > root of PGDATA. Each buffer entry is masked to make the comparison > across node consistent (flags like hint bits for example) and then > each buffer is captured is with the following format as a single line > of the output file: > LSN: %08X/%08X page: PAGE_IN_HEXA > Hexadecimal format makes it easier to detect differences between > pages, and format is chosen to facilitate comparison between buffer > entries. > - A client part, located in contrib/buffer_capture_cmp, that can be > used to compare buffer captures between nodes. Oh, you moved the masking code from the client tool to the backend. Why? When debugging, it's useful to have the genuine, non-masked page image available. - Heikki
On Fri, Jun 13, 2014 at 4:48 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 06/13/2014 10:14 AM, Michael Paquier wrote: >> >> On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> >>> On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas >>> <hlinnakangas@vmware.com> wrote: >>> Perhaps there are parts of what is proposed here that could be made >>> more generalized, like the masking functions. So do not hesitate if >>> you have any opinion on the matter. >> >> OK, attached is the result of this hacking: >> >> Buffer capture facility: check WAL replay consistency >> >> It is a tool aimed to be used by developers and buildfarm machines >> that can be used to check for consistency at page level when replaying >> WAL files among several nodes of a cluster (generally master and >> standby node). >> >> This facility is made of two parts: >> - A server part, where all the changes happening at page level are >> captured and inserted in a file called buffer_captures located at the >> root of PGDATA. Each buffer entry is masked to make the comparison >> across node consistent (flags like hint bits for example) and then >> each buffer is captured is with the following format as a single line >> of the output file: >> LSN: %08X/%08X page: PAGE_IN_HEXA >> Hexadecimal format makes it easier to detect differences between >> pages, and format is chosen to facilitate comparison between buffer >> entries. >> - A client part, located in contrib/buffer_capture_cmp, that can be >> used to compare buffer captures between nodes. > > > Oh, you moved the masking code from the client tool to the backend. Why? > When debugging, it's useful to have the genuine, non-masked page image > available. My thought is to share the CPU effort of masking between backends... That's not a big deal to move them back to the client tool though. -- Michael
On Fri, Jun 13, 2014 at 4:50 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Jun 13, 2014 at 4:48 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> On 06/13/2014 10:14 AM, Michael Paquier wrote: >>> >>> On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>>> >>>> On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas >>>> <hlinnakangas@vmware.com> wrote: >>>> Perhaps there are parts of what is proposed here that could be made >>>> more generalized, like the masking functions. So do not hesitate if >>>> you have any opinion on the matter. >>> >>> OK, attached is the result of this hacking: >>> >>> Buffer capture facility: check WAL replay consistency >>> >>> It is a tool aimed to be used by developers and buildfarm machines >>> that can be used to check for consistency at page level when replaying >>> WAL files among several nodes of a cluster (generally master and >>> standby node). >>> >>> This facility is made of two parts: >>> - A server part, where all the changes happening at page level are >>> captured and inserted in a file called buffer_captures located at the >>> root of PGDATA. Each buffer entry is masked to make the comparison >>> across node consistent (flags like hint bits for example) and then >>> each buffer is captured is with the following format as a single line >>> of the output file: >>> LSN: %08X/%08X page: PAGE_IN_HEXA >>> Hexadecimal format makes it easier to detect differences between >>> pages, and format is chosen to facilitate comparison between buffer >>> entries. >>> - A client part, located in contrib/buffer_capture_cmp, that can be >>> used to compare buffer captures between nodes. >> >> >> Oh, you moved the masking code from the client tool to the backend. Why? >> When debugging, it's useful to have the genuine, non-masked page image >> available. > My thought is to share the CPU effort of masking between backends... And that having a set of API to do page masking on the server side would be useful for extensions as well. Now that I recall this was one of the first things that came to my mind when looking at this facility, thinking that it would be useful to have them in a separate file, with a dedicated header. -- Michael
On Fri, Jun 13, 2014 at 4:14 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > A couple of things to note though: > 1) In order to detect if a page is used for a sequence, SEQ_MAGIC > needs to be exposed in sequence.h. This is included in the patch > attached but perhaps this should be changed as a separate patch > 2) Regression test facility uses some useful parts taken from > pg_upgrade. I think that we should gather those parts in a common > place (contrib/common?). This can facilitate the integration of other > modules using regression based on bash scripts. > 3) While hacking this facility, I noticed that some ItemId entries in > btree pages could be inconsistent between master and standby. Those > items are masked in the current patch, but it looks like a bug of > Postgres itself. Attached are 3 patches doing exactly this separation for lisibility. Regards, -- Michael
Attachment
On Mon, Jun 2, 2014 at 8:55 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> And here is the tool itself. It consists of two parts: >> >> 1. Modifications to the backend to write the page images >> 2. A post-processing tool to compare the logged images between master and >> standby. > Having that into Postgres at the disposition of developers would be > great, and I believe that it would greatly reduce the occurrence of > bugs caused by WAL replay during recovery. So, with the permission of > the author, I have been looking at this facility for a cleaner > integration into Postgres. I'm not sure if this is reasonably possible, but one thing that would make this tool a whole lot easier to use would be if you could make all the magic happen in a single server. For example, suppose you had a background process that somehow got access to the pre and post images for every buffer change, and the associated WAL record, and tried applying the WAL record to the pre-image to see whether it got the corresponding post-image. Then you could run 'make check' or so and afterwards do something like psql -c 'SELECT * FROM wal_replay_problems()' and hopefully get no rows back. Don't get me wrong, having this tool at all sounds great. But I think to really get the full benefit out of it we need to be able to run it in the buildfarm, so that if people break stuff it gets noticed quickly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 18, 2014 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jun 2, 2014 at 8:55 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: > I'm not sure if this is reasonably possible, but one thing that would > make this tool a whole lot easier to use would be if you could make > all the magic happen in a single server. For example, suppose you had > a background process that somehow got access to the pre and post > images for every buffer change, and the associated WAL record, and > tried applying the WAL record to the pre-image to see whether it got > the corresponding post-image. Then you could run 'make check' or so > and afterwards do something like psql -c 'SELECT * FROM > wal_replay_problems()' and hopefully get no rows back. So your point is to have a 3rd independent server in the process that would compare images taken from a master and its standby? Seems to complicate the machinery. > Don't get me wrong, having this tool at all sounds great. But I think > to really get the full benefit out of it we need to be able to run it > in the buildfarm, so that if people break stuff it gets noticed > quickly. The patch I sent has included a regression test suite making the tests rather facilitated: that's only a matter of running actually "make check" in the contrib repository containing the binary able to compare buffer captures between a master and a standby. Thanks, -- Michael
On Tue, Jun 17, 2014 at 5:40 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Jun 18, 2014 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Jun 2, 2014 at 8:55 AM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> I'm not sure if this is reasonably possible, but one thing that would >> make this tool a whole lot easier to use would be if you could make >> all the magic happen in a single server. For example, suppose you had >> a background process that somehow got access to the pre and post >> images for every buffer change, and the associated WAL record, and >> tried applying the WAL record to the pre-image to see whether it got >> the corresponding post-image. Then you could run 'make check' or so >> and afterwards do something like psql -c 'SELECT * FROM >> wal_replay_problems()' and hopefully get no rows back. > So your point is to have a 3rd independent server in the process that > would compare images taken from a master and its standby? Seems to > complicate the machinery. No, I was trying to get it down form 2 servers to 1, not 2 servers up to 3. >> Don't get me wrong, having this tool at all sounds great. But I think >> to really get the full benefit out of it we need to be able to run it >> in the buildfarm, so that if people break stuff it gets noticed >> quickly. > The patch I sent has included a regression test suite making the tests > rather facilitated: that's only a matter of running actually "make > check" in the contrib repository containing the binary able to compare > buffer captures between a master and a standby. Cool! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 16, 2014 at 12:19 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
Here are rebased patches, their was a conflict with a recent commit in contrib/pg_upgrade.On Fri, Jun 13, 2014 at 4:14 PM, Michael PaquierAttached are 3 patches doing exactly this separation for lisibility.
<michael.paquier@gmail.com> wrote:
> A couple of things to note though:
> 1) In order to detect if a page is used for a sequence, SEQ_MAGIC
> needs to be exposed in sequence.h. This is included in the patch
> attached but perhaps this should be changed as a separate patch
> 2) Regression test facility uses some useful parts taken from
> pg_upgrade. I think that we should gather those parts in a common
> place (contrib/common?). This can facilitate the integration of other
> modules using regression based on bash scripts.
> 3) While hacking this facility, I noticed that some ItemId entries in
> btree pages could be inconsistent between master and standby. Those
> items are masked in the current patch, but it looks like a bug of
> Postgres itself.
--
Michael
Attachment
On Fri, Jun 27, 2014 at 2:57 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
I am resending patch 2 as it contained a rebase conflict not correctly resolved (Thanks Alvaro).
Here are rebased patches, their was a conflict with a recent commit in contrib/pg_upgrade.
Regards,
--
Michael
Michael
Attachment
Hello, I had a look on this patch. Let me show you some comments about the README, Makefile and buffer_capture_cmp of the second part for the present. A continuation of this comment would be seen later.. - contrib/buffer_capture_cmp/README - 'contains' seems duplicate in the first paragraph. - The second pragraph says that 'This script can use the node number of the master node available as the first argumentof the script when it is run within the test suite.' But test.sh seems not giving such a parameter. - contrib/buffer_capture_cmp/Makefile "make check" does nothing when BUFFER_CAPTURE is not defined, asdescribed in itself. But I trapped by that after build theserverby 'make CFLAGS="-DBUFFER_CAPTURE"':( It would be betterthat 'make check' without defining it prints some message. - buffer_capture_cmp.c This source generates void executable when BUFFER_CAPTURE is not defined. The restriction seems to be put only to use BUFFER_CAPTURE_FILEin bufcapt.h. If so, changing the parameter of the executable as described in the following comment formain() would blow off the necessity for the restriction. - buffer_capture_cmp.c/main() The parameters for this command are the parent directories for each capture file. This is a bit incovenient for separateuse. For example, when I want to gather the capture files from multiple servers then compare them, I should unwillinglymake their own directories for each capture file. If no particular reason exists for the design, I suppose itwould be more convenient that the parameters are the names of the capture files themselves. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jul 1, 2014 at 7:25 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
-- Hello, I had a look on this patch.
Thanks for your comments. Looking forward to seeing some more input.
- contrib/buffer_capture_cmp/README
- 'contains' seems duplicate in the first paragraph.
- The second paragraph says that 'This script can use the node
number of the master node available as the first argument of
the script when it is run within the test suite.' But test.sh
seems not giving such a parameter.
Yeah right... This was a rest of some previous hacking on this feature. Paragraph was rather unclear so I rewrote it, mentioning that the custom script can use PGPORT to connect to the node where tests can be run.
- contrib/buffer_capture_cmp/Makefile
"make check" does nothing when BUFFER_CAPTURE is not defined, as
described in itself. But I trapped by that after build the
server by 'make CFLAGS="-DBUFFER_CAPTURE"':( It would be better
that 'make check' without defining it prints some message.
Sure, I added such a message in the makefile.
- buffer_capture_cmp.c
This source generates void executable when BUFFER_CAPTURE is
not defined. The restriction seems to be put only to use
BUFFER_CAPTURE_FILE in bufcapt.h. If so, changing the parameter
of the executable as described in the following comment for
main() would blow off the necessity for the restriction.
Done. The compilation of this utility is now independent on BUFFER_CAPTURE. At the same time I made test.sh a bit smarter to have it grab the value of BUFFER_CAPTURE_FILE directly from bufcapt.h.
- buffer_capture_cmp.c/main()
The parameters for this command are the parent directories for
each capture file. This is a bit inconvenient for separate
use. For example, when I want to gather the capture files from
multiple servers then compare them, I should unwillingly make
their own directories for each capture file. If no particular
reason exists for the design, I suppose it would be more
convenient that the parameters are the names of the capture
files themselves.
Fixed. I changed back the utility to directly file names instead of data folders as arguments.
Updated patches addressing those comments are attached.
Regards,
Michael
Attachment
Hello, > Thanks for your comments. Looking forward to seeing some more input. Thank you. bufcapt.c was a poser. bufcapt.c: 326 memcpy(&tail, &page[BLCKSZ - 2], 2); This seems duzzling.. Isn't "*(uint16*)(&page[BLCKSZ - 2])" applicable? bufcapt.c: 331 else if (PageGetSpecial.... Generally saying, the code to identify the type of page is too heuristic and seems fragile. Pehaps this is showing that no tidy or generalized way to tell what a page is used for. Many of the modules which have theirown page format has a magic value and the values seem to be selected carefully. But no one-for-all method to retrievethat. Each type of page can be confirmed by the following way *if* its type is previously *hinted* except for gin. btree : 32bit magic at pd->opaque gin : No magic gist : 16-bit magic at ((GISTPageOpaque*)pd->opaque)->gist_page_idspgist : 16-bit magic at ((SpGistPageOpaque*)pd->opaque)->spgist_page_id hash : 16-bit magic at ((HashPageOpaque*)pd->paque)->hasho_page_id sequence : 16-bit magic at pd->opaque, the last 2 bytesof the page. # Is it comprehensive? and correct? The majority is 16-bit magic at the TAIL of opaque struct. If we can unify them into , say, 16-bit magic at *(uint16*)(pd->opaque)the sizeof() labyrinth become stable and simple and other tools which should identify the type of pageswill be happy. Do you think we can do that? # Sorry, time's up for today. > > - contrib/buffer_capture_cmp/README .. > Yeah right... This was a rest of some previous hacking on this feature. > Paragraph was rather unclear so I rewrote it, mentioning that the custom > script can use PGPORT to connect to the node where tests can be run. > > - contrib/buffer_capture_cmp/Makefile > > > > "make check" does nothing when BUFFER_CAPTURE is not defined, as ... > Sure, I added such a message in the makefile. > > - buffer_capture_cmp.c > > > > This source generates void executable when BUFFER_CAPTURE is .. > Done. The compilation of this utility is now independent on BUFFER_CAPTURE. > At the same time I made test.sh a bit smarter to have it grab the value of > BUFFER_CAPTURE_FILE directly from bufcapt.h. > > - buffer_capture_cmp.c/main() > > > > The parameters for this command are the parent directories for ... > Fixed. I changed back the utility to directly file names instead of data > folders as arguments. > > Updated patches addressing those comments are attached. > Regards, Thank you I'll look into them later. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
TODO On Wed, Jul 2, 2014 at 5:32 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > bufcapt.c: 326 memcpy(&tail, &page[BLCKSZ - 2], 2); > > This seems duzzling.. Isn't "*(uint16*)(&page[BLCKSZ - 2])" applicable? Well yes it is. > Pehaps this is showing that no tidy or generalized way to tell > what a page is used for. Many of the modules which have their > own page format has a magic value and the values seem to be > selected carefully. But no one-for-all method to retrieve that. You have a point here. > Each type of page can be confirmed by the following way *if* > its type is previously *hinted* except for gin. > > btree : 32bit magic at pd->opaque > gin : No magic > gist : 16-bit magic at ((GISTPageOpaque*)pd->opaque)->gist_page_id > spgist : 16-bit magic at ((SpGistPageOpaque*)pd->opaque)->spgist_page_id > hash : 16-bit magic at ((HashPageOpaque*)pd->paque)->hasho_page_id > sequence : 16-bit magic at pd->opaque, the last 2 bytes of the page. > > # Is it comprehensive? and correct? Sequence pages use the last 4 bytes. Have a look at sequence_magic in sequence.c. For btree pages we can use the last 2 bytes and a check on MAX_BT_CYCLE_ID. For gin, I'll investigate if it is possible to add a identifier like GIN_PAGE_ID, it would make the page analysis more consistent with the others. I am not sure for what the 8 bytes allocated for the special area are used now for though. > The majority is 16-bit magic at the TAIL of opaque struct. If > we can unify them into , say, 16-bit magic at > *(uint16*)(pd->opaque) the sizeof() labyrinth become stable and > simple and other tools which should identify the type of pages > will be happy. Do you think we can do that? Yes I think so. I'll raise a different thread though as this is a different problem that what this patch is targeting. I would even imagine a macro in bufpage.c able to handle that well. Regards, -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > On Wed, Jul 2, 2014 at 5:32 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> Pehaps this is showing that no tidy or generalized way to tell >> what a page is used for. Many of the modules which have their >> own page format has a magic value and the values seem to be >> selected carefully. But no one-for-all method to retrieve that. > You have a point here. Yeah, it's a bit messy, but I believe it's currently always possible to tell which access method a PG page belongs to. Look at pg_filedump. The last couple times we added index access methods, we took pains to make sure pg_filedump could figure out what their pages were. (IIRC, it's a combination of the special-space size and contents, but I'm too tired to go check the details right now.) > For gin, I'll investigate if it is possible to add a identifier like > GIN_PAGE_ID, it would make the page analysis more consistent with the > others. I am not sure for what the 8 bytes allocated for the special > area are used now for though. There is exactly zero chance that anyone will accept an on-disk format change just to make this prettier. regards, tom lane
On Thu, Jul 3, 2014 at 3:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> On Wed, Jul 2, 2014 at 5:32 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> Pehaps this is showing that no tidy or generalized way to tell >>> what a page is used for. Many of the modules which have their >>> own page format has a magic value and the values seem to be >>> selected carefully. But no one-for-all method to retrieve that. > >> You have a point here. > > Yeah, it's a bit messy, but I believe it's currently always possible to > tell which access method a PG page belongs to. Look at pg_filedump. > The last couple times we added index access methods, we took pains to > make sure pg_filedump could figure out what their pages were. (IIRC, > it's a combination of the special-space size and contents, but I'm too > tired to go check the details right now.) Yes, that's what the current code does btw, in this *messy* way. >> For gin, I'll investigate if it is possible to add a identifier like >> GIN_PAGE_ID, it would make the page analysis more consistent with the >> others. I am not sure for what the 8 bytes allocated for the special >> area are used now for though. > > There is exactly zero chance that anyone will accept an on-disk format > change just to make this prettier. Yeah thought so. -- Michael
Hello, This is the additional comments for other part. I haven't see significant defect in the code so far. ===== bufcapt.c: - buffer_capture_remember() or so. Pages for unlogged tables are avoided to be written takingadvantage that the lsn for such pages stays 0/0. I'd like to seeacomment mentioning for, say, buffer_capture_is_changed? orbuffer_capture_forget or somewhere. - buffer_capture_forget() However this error is likely not to occur, in the error message"could not find image...", the buffer id seems to bring noinformation.LSN, or quadraplet of LSN, rnode, forknum andblockno seems to be more informative. - buffer_capture_is_changed() The name for the function semes to be misleading. What thisfunction does is comparing LSNs between stored page image andcurrentpage. lsn_is_changed(BufferImage) or something likewould be more clearly. ====== bufmgr.c: - ConditionalLockBuffer() Sorry for a trivial comment, but the variable 'res' concealesthe meaning. "acquired" seems to be more preferable, isn't it? - LockBuffer() There is a 'XXX' comment. The discussion written there seems tobe right, for me. If you mind that peeking into there isa badbehavior, adding a macro into lwlock.h would help:p lwlock.h: #define LWLockHoldedExclusive(l) ((l)->exclusive > 0)lwlock.h: #define LWLockHoldedShared(l) ((l)->shared > 0) # I don't see this usable so much..bufmgr.c: if (LWLockHoldedExclusive(buf->content_lock)) If there isn't any particular concern, 'XXX:' should be removed. ===== bufpage.c: - #include "storage/bufcapt.h" The include seems to be needless. ===== bufcapt.h: - header comment The file name is misspelled as 'bufcaptr.h'. Copyright notice of UC is needed? (Other files are the same) - The includes in this header except for buf.h seem not to be necessary. ===== init_env.sh: - pg_get_test_port() It determines server port using PG_VERSION_NUM, but is it necessary? Addition to it, the theoretical maximum initial PGPORTsemes to be 65535 but this script search for free port up to the number larger by 16 from the start, which would bebeyond the edge. - pg_get_test_port() It stops with error after 16 times of failure, but the message complains only about the final attempt. If you want to mentionthe port numbers, it might should be 'port $PGPORTSTART-$PGPORT ..' or would be 'All 16 ports attempted failed' orsomething.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
OK, I have been working more on this patch, improving on-the-fly the following things on top of what Horiguchi-san has reported: - Moved sequence page opaque data to sequence.h, renaming it at the same time. - Improvement of page type identification, particularly for sequences using a correct opaque data structure. For gin the process is not that cool, but I guess that there is nothing much to do as it has no proper page identifier :( On Thu, Jul 3, 2014 at 7:34 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > ===== bufcapt.c: > > - buffer_capture_remember() or so. > > Pages for unlogged tables are avoided to be written taking > advantage that the lsn for such pages stays 0/0. I'd like to see > a comment mentioning for, say, buffer_capture_is_changed? or > buffer_capture_forget or somewhere. Yes, it is worth mentioning and a comment in bufcapt.h seems enough. > - buffer_capture_forget() > > However this error is likely not to occur, in the error message > "could not find image...", the buffer id seems to bring no > information. LSN, or quadraplet of LSN, rnode, forknum and > blockno seems to be more informative. Yesh, this seems informative. > - buffer_capture_is_changed() > > The name for the function seems to be misleading. What this > function does is comparing LSNs between stored page image and > current page. lsn_is_changed(BufferImage) or something like > would be clearer. Hm, yes. This name looks better fine as it remains static within bufcapt.c. > ====== bufmgr.c: > > - ConditionalLockBuffer() > > Sorry for a trivial comment, but the variable 'res' conceales > the meaning. "acquired" seems to be more preferable, isn't it? Fixed. > - LockBuffer() > > There is a 'XXX' comment. The discussion written there seems to > be right, for me. If you mind that peeking into there is a bad > behavior, adding a macro into lwlock.h would help:p > > lwlock.h: #define LWLockHoldedExclusive(l) ((l)->exclusive > 0) > lwlock.h: #define LWLockHoldedShared(l) ((l)->shared > 0) I don't think that there is much to gain with such macros as of now LWLock->exclusive is only used in the code this patch introduces. > # I don't see this usable so much.. > > bufmgr.c: if (LWLockHoldedExclusive(buf->content_lock)) > > If there isn't any particular concern, 'XXX:' should be removed. Well yes. > ===== bufpage.c: > > - #include "storage/bufcapt.h" > > The include seems to be needless. Yep. > ===== bufcapt.h: > > - header comment > > The file name is misspelled as 'bufcaptr.h'. Nicely spotted. > - The includes in this header except for buf.h seem not to be > necessary. Yep. > ===== init_env.sh: > > - pg_get_test_port() > It determines server port using PG_VERSION_NUM, but is it > necessary? Addition to it, the theoretical maximum initial > PGPORT seems to be 65535 but this script search for free port > up to the number larger by 16 from the start, which would be > beyond the edge. Hm, no. As of now, there is still some margin: PG_VERSION_NUM = 90500 PG_VERSION_NUM % 16384 + 49152 = 57732 > - pg_get_test_port() > > It stops with error after 16 times of failure, but the message > complains only about the final attempt. If you want to mention > the port numbers, it might should be 'port $PGPORTSTART-$PGPORT > ..' or would be 'All 16 ports attempted failed' or something.. Hm. You could complain about pg_upgrade as well now for the same thing. But I guess that it doesn't hurt to complain back to caller about the range of ports already in use, so changed this way. Regards, -- Michael
Attachment
Hello, thank you for considering my comments, including somewhat impractical ones. I'll have a look at the latest patch sooner. At Fri, 4 Jul 2014 15:29:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqT_fs_3jLNHYWC6nzej4sBL6DGsLFVCg0JBUkgjeP9Tfw@mail.gmail.com> > OK, I have been working more on this patch, improving on-the-fly the > following things on top of what Horiguchi-san has reported: > - Moved sequence page opaque data to sequence.h, renaming it at the same time. > - Improvement of page type identification, particularly for sequences > using a correct opaque data structure. For gin the process is not that > cool, but I guess that there is nothing much to do as it has no proper > page identifier :( Year, there seems to be no choice than that. > On Thu, Jul 3, 2014 at 7:34 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > ===== bufcapt.c: > > > > - buffer_capture_remember() or so. ... > Yes, it is worth mentioning and a comment in bufcapt.h seems enough. > > > - buffer_capture_forget() ... > Yesh, this seems informative. > > > - buffer_capture_is_changed() ... > Hm, yes. This name looks better fine as it remains static within bufcapt.c. > > > ====== bufmgr.c: > > > > - ConditionalLockBuffer() ... > Fixed. > > > - LockBuffer() ... > > lwlock.h: #define LWLockHoldedExclusive(l) ((l)->exclusive > 0) > > lwlock.h: #define LWLockHoldedShared(l) ((l)->shared > 0) > I don't think that there is much to gain with such macros as of now > LWLock->exclusive is only used in the code this patch introduces. Year, I think so, too:p That's simply for the case if youthought that. > > If there isn't any particular concern, 'XXX:' should be removed. > Well yes. That's great. > > ===== bufpage.c: > > ===== bufcapt.h: > > > > - header comment > > > > The file name is misspelled as 'bufcaptr.h'. > Nicely spotted. Thank you ;) > > - The includes in this header except for buf.h seem not to be > > necessary. > Yep. > > > ===== init_env.sh: > > > > - pg_get_test_port() > > It determines server port using PG_VERSION_NUM, but is it > > necessary? Addition to it, the theoretical maximum initial > > PGPORT seems to be 65535 but this script search for free port > > up to the number larger by 16 from the start, which would be > > beyond the edge. > Hm, no. As of now, there is still some margin: > PG_VERSION_NUM = 90500 > PG_VERSION_NUM % 16384 + 49152 = 57732 Yes, it's practically no problem. I said about the theroretical max value seeing it without any preassumption about the value of PG_VERSION_NUM. There's in reality no problem before the PostgreSQL 9.82.88 comes, and 10.0.0 doesn't cause problem. So I'm not so dissapointed if it is not fixed. > > - pg_get_test_port() > > > > It stops with error after 16 times of failure, but the message > > complains only about the final attempt. If you want to mention > > the port numbers, it might should be 'port $PGPORTSTART-$PGPORT > > ..' or would be 'All 16 ports attempted failed' or something.. > Hm. You could complain about pg_upgrade as well now for the same > thing. But I guess that it doesn't hurt to complain back to caller > about the range of ports already in use, so changed this way. Yes, this comment is also comes from a kind of fastidiousness. I'm satisified not to fixed if you think so. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, At Thu, 3 Jul 2014 14:48:50 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQ2y3QkapAsaC6oXXQTWbVkkxCrfTuA0w+DX-j3i-LByQ@mail.gmail.com> > TODO ... > > Each type of page can be confirmed by the following way *if* > > its type is previously *hinted* except for gin. > > > > btree : 32bit magic at pd->opaque > > gin : No magic > > gist : 16-bit magic at ((GISTPageOpaque*)pd->opaque)->gist_page_id > > spgist : 16-bit magic at ((SpGistPageOpaque*)pd->opaque)->spgist_page_id > > hash : 16-bit magic at ((HashPageOpaque*)pd->paque)->hasho_page_id > > sequence : 16-bit magic at pd->opaque, the last 2 bytes of the page. > > > > # Is it comprehensive? and correct? > Sequence pages use the last 4 bytes. Have a look at sequence_magic in > sequence.c. > For btree pages we can use the last 2 bytes and a check on MAX_BT_CYCLE_ID. > For gin, I'll investigate if it is possible to add a identifier like > GIN_PAGE_ID, it would make the page analysis more consistent with the > others. I am not sure for what the 8 bytes allocated for the special > area are used now for though. > > > The majority is 16-bit magic at the TAIL of opaque struct. If > > we can unify them into , say, 16-bit magic at > > *(uint16*)(pd->opaque) the sizeof() labyrinth become stable and > > simple and other tools which should identify the type of pages > > will be happy. Do you think we can do that? > Yes I think so. I'll raise a different thread though as this is a > different problem that what this patch is targeting. I would even > imagine a macro in bufpage.c able to handle that well. Ok, that being the case, this topic should be stashed and I'll look into there regardless of it. Thank you. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, The new patch looks good for me. The usage of this is a bit irregular as a (extension) module but it is the nature of this 'contrib'. The rearranged page type detection logic seems neater and keeps to work as intended. This logic will be changed after the new page type detection scheme becomes ready by the another patch. I have some additional comments, which should be the last ones. All of the comments are about test.sh. - A question mark seems missing from the end of the message "Has build been done with -DBUFFER_CAPTURE included in CFLAGS"in test.sh. - "make check" runs "regress --use-existing" but IMHO the make target which is expected to do that is installcheck. I hadfooled to convince that it should run the postgres which is built dedicatedly:( - postgres processes are left running after test_default(custom).sh has stopped halfway. This can be fixed with the attachedpatch, but, to be honest, this seems too much. I'll follow your decision whether or not to do this. (bufcapt_test_sh_20140710.patch) - test_default.sh is not only an example script which will run while utilize this facility, but the test script for thisfacility itself. So I think it would be better be added some queries so that all possible page types available for the default testing. Whatdo you think about the attached patch? (hash index is unlogged but I dared to put it for clarity.) (bufcapt_test_default_sh_20140710.patch) regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/contrib/buffer_capture_cmp/test.sh b/contrib/buffer_capture_cmp/test.sh index 89740bb..ba5e290 100644 --- a/contrib/buffer_capture_cmp/test.sh +++ b/contrib/buffer_capture_cmp/test.sh @@ -117,16 +117,27 @@ pg_ctl -w -D $TEST_STANDBY start# Check the presence of custom tests and kick them in priority. Ifnot,# fallback to the default tests. Tests need only to be run on the master# node. +if [ -f ./test-custom.sh ]; then - . ./test-custom.sh + TEST_SCRIPT=test-custom.shelse - . ./test-default.sh + TEST_SCRIPT=test-default.shfi +set +e +bash -e $TEST_SCRIPT +EXITSTATUS=$? +set -e +# Time to stop the nodes as tests have runpg_ctl -w -D $TEST_MASTER stoppg_ctl -w -D $TEST_STANDBY stop +if [ $EXITSTATUS != 0 ]; then + echo "$TEST_SCRIPT exited by error" + exit 1; +fi +DIFF_FILE=capture_differences.txt# Check if the capture files exist. If not, build may have not been diff --git a/contrib/buffer_capture_cmp/test-default.sh b/contrib/buffer_capture_cmp/test-default.sh index 5bec503..24091ff 100644 --- a/contrib/buffer_capture_cmp/test-default.sh +++ b/contrib/buffer_capture_cmp/test-default.sh @@ -11,4 +11,16 @@# cd $ROOT_DIR# Create a simple table -psql -c 'CREATE TABLE aa AS SELECT generate_series(1, 10) AS a' +psql -c 'CREATE TABLE tbtree AS SELECT generate_series(1, 10) AS a' +psql -c 'CREATE INDEX i_tbtree ON tbtree USING btree(a)' +psql -c 'CREATE TABLE thash AS SELECT generate_series(1, 10) AS a' +psql -c 'CREATE INDEX i_thash ON thash USING hash(a)' +psql -c 'CREATE TABLE tgist AS SELECT POINT(a, a) AS p1 FROM generate_series(0, 10) a' +psql -c 'CREATE INDEX i_tgist ON tgist USING gist(p1)' +psql -c 'CREATE TABLE tspgist AS SELECT POINT(a, a) AS p1 FROM generate_series(0, 10) a' +psql -c 'CREATE INDEX i_tspgist ON tspgist USING spgist(p1)' +psql -c 'CREATE TABLE tgin AS SELECT ARRAY[a/10, a%10] as a1 from generate_series(0, 10) a' +psql -c 'CREATE INDEX i_tgin ON tgin USING gin(a1)' +psql -c 'CREATE SEQUENCE sq1' + +
Updated patches attached. On Thu, Jul 10, 2014 at 7:13 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > The usage of this is a bit irregular as a (extension) module but > it is the nature of this 'contrib'. The rearranged page type > detection logic seems neater and keeps to work as intended. This > logic will be changed after the new page type detection scheme > becomes ready by the another patch. No disk format changes will be allowed just to make page detection easier (Tom mentioned that earlier in this thread btw). We'll have to live with what current code offers, especially considering that adding new bytes for page detection for gin pages would double the size of its special area after applying MAXALIGN to it. > - A question mark seems missing from the end of the message "Has > build been done with -DBUFFER_CAPTURE included in CFLAGS" in > test.sh. Fixed. > - "make check" runs "regress --use-existing" but IMHO the make > target which is expected to do that is installcheck. I had > fooled to convince that it should run the postgres which is > built dedicatedly:( Yes, the patch is abusing of that. --use-existing is specified in this case because the test itself is controlling Postgres servers to build and fetch the buffer captures. This allows more flexible machinery IMO. > - postgres processes are left running after > test_default(custom).sh has stopped halfway. This can be fixed > with the attached patch, but, to be honest, this seems too > much. I'll follow your decision whether or not to do this. > (bufcapt_test_sh_20140710.patch) I had considered that first, thinking that it was the user responsibility if things are screwed up with his custom scripts. I guess that the way you are doing it is a safeguard simple enough though, so included with some editing, particularly reporting to the user the error code returned by the test script. > - test_default.sh is not only an example script which will run > while utilize this facility, but the test script for this > facility itself. > So I think it would be better be added some queries so that all > possible page types available for the default testing. What do > you think about the attached patch? (hash index is unlogged > but I dared to put it for clarity.) Yeah, having a default set of queries run just to let the user get an idea of how it works improves things. Once again thanks for taking the time to look at that. Regards, -- Michael
Attachment
Hello, Let me apologize for continuing the discussion even though the deadline is approaching. At Fri, 11 Jul 2014 09:49:55 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTJEzOz-FotibSEjyG0eaBngx2PLqywoDChYFXzFqYQkg@mail.gmail.com> > Updated patches attached. > > On Thu, Jul 10, 2014 at 7:13 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > > The usage of this is a bit irregular as a (extension) module but > > it is the nature of this 'contrib'. The rearranged page type > > detection logic seems neater and keeps to work as intended. This > > logic will be changed after the new page type detection scheme > > becomes ready by the another patch. > > No disk format changes will be allowed just to make page detection > easier (Tom mentioned that earlier in this thread btw). We'll have to > live with what current code offers, > especially considering that adding > new bytes for page detection for gin pages would double the size of > its special area after applying MAXALIGN to it. That's awkward, but I agree with it. By the way, I found PageHeaderData.pd_flags to have 9 bits room. It seems to be usable if no other usage is in sight right now, if the formal method to identify page types is worth the 3-4 bits there. # This is a separate discussion from this patch itself. > > - "make check" runs "regress --use-existing" but IMHO the make > > target which is expected to do that is installcheck. I had > > fooled to convince that it should run the postgres which is > > built dedicatedly:( > > Yes, the patch is abusing of that. --use-existing is specified in this > case because the test itself is controlling Postgres servers to build > and fetch the buffer captures. This allows more flexible machinery > IMO. Although I doubt necessity of the flexibility seeing the current testing framework, I don't have so strong objection about that. Nevertheless, perhaps you are appreciated to put a notice on.. README or somewhere. > > - postgres processes are left running after > > test_default(custom).sh has stopped halfway. This can be fixed > > with the attached patch, but, to be honest, this seems too > > much. I'll follow your decision whether or not to do this. > > (bufcapt_test_sh_20140710.patch) > > I had considered that first, thinking that it was the user > responsibility if things are screwed up with his custom scripts. I > guess that the way you are doing it is a safeguard simple enough > though, so included with some editing, particularly reporting to the > user the error code returned by the test script. Agreed. > > - test_default.sh is not only an example script which will run > > while utilize this facility, but the test script for this > > facility itself. > > So I think it would be better be added some queries so that all > > possible page types available for the default testing. What do > > you think about the attached patch? (hash index is unlogged > > but I dared to put it for clarity.) > > Yeah, having a default set of queries run just to let the user get an > idea of how it works improves things. > Once again thanks for taking the time to look at that. Thank you. regardes, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Jul 14, 2014 at 6:14 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Although I doubt necessity of the flexibility seeing the current > testing framework, I don't have so strong objection about > that. Nevertheless, perhaps you are appreciated to put a notice > on.. README or somewhere. Hm, well... Fine, I added it in this updated series. Regards, -- Michael
Attachment
Hello, > > Although I doubt necessity of the flexibility seeing the current > > testing framework, I don't have so strong objection about > > that. Nevertheless, perhaps you are appreciated to put a notice > > on.. README or somewhere. > Hm, well... Fine, I added it in this updated series. Thank you for your patience:) After all, I have no more comment about this patch. I will mark this as 'Ready for committer' unless no comment comes from others for a few days. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, I'm not so confident whether it's the time to do this... I mark this as 'Ready for Committer' since no additional comment or objection was put by others on this patch. > After all, I have no more comment about this patch. I will mark > this as 'Ready for committer' unless no comment comes from others > for a few days. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Michael Paquier wrote: > On Mon, Jul 14, 2014 at 6:14 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Although I doubt necessity of the flexibility seeing the current > > testing framework, I don't have so strong objection about > > that. Nevertheless, perhaps you are appreciated to put a notice > > on.. README or somewhere. > Hm, well... Fine, I added it in this updated series. Did this go anywhere? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Michael Paquier wrote: > On Mon, Jul 14, 2014 at 6:14 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Although I doubt necessity of the flexibility seeing the current > > testing framework, I don't have so strong objection about > > that. Nevertheless, perhaps you are appreciated to put a notice > > on.. README or somewhere. > Hm, well... Fine, I added it in this updated series. FWIW I gave this a trial run and found I needed some tweaks to test.sh and the Makefile in order to make it work on VPATH; mainly replace ./ with `dirname $0` in a couple test.sh in a couple of places, and something similar in the Makefile. Also you have $PG_ROOT_DIR somewhere which doesn't work. Also you have the Makefile checking for -DBUFFER_CAPTURE exactly but for some reason I used -DBUFFER_CAPTURE=1 which wasn't well received by your $(filter) stuff. Instead of checking CFLAGS it might make more sense to expose it as a read-only GUC and grep `postmaster -C buffer_capture` or similar. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 11/4/14 3:21 PM, Alvaro Herrera wrote: > FWIW I gave this a trial run and found I needed some tweaks to test.sh > and the Makefile in order to make it work on VPATH; mainly replace ./ > with `dirname $0` in a couple test.sh in a couple of places, and > something similar in the Makefile. Also you have $PG_ROOT_DIR somewhere > which doesn't work. I also saw some bashisms in the script. Maybe the time for shell-based test scripts has passed?
Thanks for the tests.
On Wed, Nov 5, 2014 at 5:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
FWIW I gave this a trial run and found I needed some tweaks to test.shMichael Paquier wrote:
> On Mon, Jul 14, 2014 at 6:14 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Although I doubt necessity of the flexibility seeing the current
> > testing framework, I don't have so strong objection about
> > that. Nevertheless, perhaps you are appreciated to put a notice
> > on.. README or somewhere.
> Hm, well... Fine, I added it in this updated series.
and the Makefile in order to make it work on VPATH; mainly replace ./
with `dirname $0` in a couple test.sh in a couple of places, and
something similar in the Makefile. Also you have $PG_ROOT_DIR somewhere
which doesn't work.
Ah thanks, forgot that.
Also you have the Makefile checking for -DBUFFER_CAPTURE exactly but for
some reason I used -DBUFFER_CAPTURE=1 which wasn't well received by your
$(filter) stuff. Instead of checking CFLAGS it might make more sense to
expose it as a read-only GUC and grep `postmaster -C buffer_capture` or
similar.
Yes that's a good idea.
Now, do we really want this feature in-core? That's somewhat a duplicate of what is mentioned here:
http://www.postgresql.org/message-id/CAB7nPqQMq=4eJAK317mxZ4Has0i+1rSLBQU29zx18JwLB2j1OA@mail.gmail.com
http://www.postgresql.org/message-id/CAB7nPqQMq=4eJAK317mxZ4Has0i+1rSLBQU29zx18JwLB2j1OA@mail.gmail.com
Of course both things do not have the same coverage as the former is for buildfarm and dev, while the latter is dedidated to production systems, but could be used for development as well.
The patch sent there is a bit outdated, but a potential implementation gets simpler with XLogReadBufferForRedo able to return flags about each block state during redo. I am still planning to come back to it for this cycle, though I stopped for now waiting for the WAL format patches finish to shape the APIs this feature would rely on.
Regards,
--
Michael
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Wed, Nov 5, 2014 at 6:29 AM, Peter Eisentraut<span dir="ltr"><<a href="mailto:peter_e@gmx.net" target="_blank">peter_e@gmx.net</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On11/4/14 3:21 PM, Alvaro Herrera wrote:<br /> > FWIW I gave this a trial run and found I needed some tweaksto test.sh<br /> > and the Makefile in order to make it work on VPATH; mainly replace ./<br /> > with `dirname$0` in a couple test.sh in a couple of places, and<br /> > something similar in the Makefile. Also you have $PG_ROOT_DIRsomewhere<br /> > which doesn't work.<br /><br /></span>I also saw some bashisms in the script.<br /><br />Maybe the time for shell-based test scripts has passed?<br /></blockquote></div>Except pg_upgrade, are there other testsusing bash?<br />-- <br /><div class="gmail_signature">Michael<br /></div></div></div>
Michael Paquier wrote: > Now, do we really want this feature in-core? That's somewhat a duplicate of > what is mentioned here: > http://www.postgresql.org/message-id/CAB7nPqQMq=4eJAK317mxZ4Has0i+1rSLBQU29zx18JwLB2j1OA@mail.gmail.com > Of course both things do not have the same coverage as the former is for > buildfarm and dev, while the latter is dedidated to production systems, but > could be used for development as well. Oh, I had forgotten that other patch. > The patch sent there is a bit outdated, but a potential implementation gets > simpler with XLogReadBufferForRedo able to return flags about each block > state during redo. I am still planning to come back to it for this cycle, > though I stopped for now waiting for the WAL format patches finish to shape > the APIs this feature would rely on. I agree it makes sense to wait until the WAL reworks are done -- glad to hear you're putting some time in this area. Thanks, -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 11/4/14 10:50 PM, Michael Paquier wrote: > Except pg_upgrade, are there other tests using bash? There are a few obscure things under src/test/.
On Thu, Nov 6, 2014 at 5:41 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
$ git grep "/sh" src/test/On 11/4/14 10:50 PM, Michael Paquier wrote:
> Except pg_upgrade, are there other tests using bash?
There are a few obscure things under src/test/.
Oh, I see. There is quite a number here, and each script is doing quite different things..
src/test/locale/de_DE.ISO8859-1/runall:#! /bin/sh
src/test/locale/gr_GR.ISO8859-7/runall:#! /bin/sh
src/test/locale/koi8-r/runall:#! /bin/sh
src/test/locale/koi8-to-win1251/runall:#! /bin/sh
src/test/mb/mbregress.sh:#! /bin/sh
src/test/performance/start-pgsql.sh:#!/bin/sh
src/test/regress/regressplans.sh:#! /bin/sh
--
Michael
Michael Paquier wrote: > From 077d675795b4907904fa4e85abed8c4528f4666f Mon Sep 17 00:00:00 2001 > From: Michael Paquier <michael@otacoo.com> > Date: Sat, 19 Jul 2014 10:40:20 +0900 > Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency Is there a newer version of this tech? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jun 18, 2015 at 3:39 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Michael Paquier wrote: > >> From 077d675795b4907904fa4e85abed8c4528f4666f Mon Sep 17 00:00:00 2001 >> From: Michael Paquier <michael@otacoo.com> >> Date: Sat, 19 Jul 2014 10:40:20 +0900 >> Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency > > Is there a newer version of this tech? Not yet. -- Michael