Thread: [GENERAL] checkpoint clarifications needed
Hi, I need some clarifications about checkpoints.
Below here a log from my standby server when started and then some parts of the interested WAL in the master's cluster obtained by pg_xlogdump.
Just to have an example to talk on.
1) I see: "LOG: redo starts at 1/F00A7448" . I was expecting a checkpoint record around 1/F00A7448 but the related checkpoint record is at lsn: 1/FCBD7510 instead.
I expected that because I read in the doc :"Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with ALL information written BEFORE that checkpoint".
And I interpreted that as "All information written before that checkpoint RECORD" but now I guess that one thing is a checkpoint point and one thing is a checkpoint record. Right?
2) I see that a checkpoint position can be right in the middle of a group of records related to a transaction (in the example, transaction id 10684). So a checkpoint position is NOT a consistency state point, right?
3) According to doc at 29.5, in pg_control the position of the last checkpoint record (not the checkpoint position itself) is saved right?
4) If I'm right at 2) then, between the checkpoint position (1/F00A7448 ) and the checkpoint record position (1/FCBD7510) there must be a point where the DB is in a consistency state. If not, in case of crash just after writing the checkpoint record to the WAL and its position to pg_control, the system would replay from the checkpoint position (known by last checkpoint record) without finding a consistency state. Right?
5) How can we define, in terms of log records, a consistency state position?
Best regards
Pupillo
STANDBY SERVER LOG
LOG: redo starts at 1/F00A7448
....
LOG: consistent recovery state reached at 2/426DF28
LOG: invalid record length at 2/426DF28: wanted 24, got 0
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 2/4000000 on timeline 1
FROM PG_XLOGDUMP OF MASTER
........
rmgr: Heap len (rec/tot): 14/ 1186, tx: 10684, lsn: 1/F009EEE0, prev 1/F009EEA8, desc: UPDATE off 1 xmax 10684 ; new off 3 xmax 0, blkref #0: rel 1663/16384/16422 blk 20054, blkref #1: rel 1663/16384/16422 blk 19774
rmgr: Btree len (rec/tot): 2/ 64, tx: 10684, lsn: 1/F009F388, prev 1/F009EEE0, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16384/16428 blk 711
rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: 1/F009F3C8, prev 1/F009F388, desc: FPI_FOR_HINT , blkref #0: rel 1663/16384/16484 blk 9663 FPW
rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: 1/F00A13E8, prev 1/F009F3C8, desc: FPI_FOR_HINT , blkref #0: rel 1663/16384/16484 blk 9664 FPW
rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: 1/F00A3408, prev 1/F00A13E8, desc: FPI_FOR_HINT , blkref #0: rel 1663/16384/16484 blk 9665 FPW
rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: 1/F00A5428, prev 1/F00A3408, desc: FPI_FOR_HINT , blkref #0: rel 1663/16384/16484 blk 9666 FPW
rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: 1/F00A7448, prev 1/F00A5428, desc: FPI_FOR_HINT , blkref #0: rel 1663/16384/16484 blk 9667 FPW
rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: 1/F00A9468, prev 1/F00A7448, desc: FPI_FOR_HINT , blkref #0: rel 1663/16384/16484 blk 9668 FPW
rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: 1/F00AB488, prev 1/F00A9468, desc: FPI_FOR_HINT , blkref #0: rel 1663/16384/16484 blk 9669 FPW
rmgr: Heap len (rec/tot): 8/ 8063, tx: 10682, lsn: 1/F00AD4A8, prev 1/F00AB488, desc: LOCK off 5: xid 10682: flags 0 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/16384/16422 blk 19787 FPW
rmgr: Heap len (rec/tot): 8/ 8063, tx: 10684, lsn: 1/F00AF440, prev 1/F00AD4A8, desc: LOCK off 2: xid 10684: flags 0 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/16384/16422 blk 19774 FPW
rmgr: Heap len (rec/tot): 14/ 4657, tx: 10682, lsn: 1/F00B13D8, prev 1/F00AF440, desc: UPDATE off 5 xmax 10682 ; new off 4 xmax 0, blkref #0: rel 1663/16384/16422 blk 20075 FPW, blkref #1: rel 1663/16384/16422 blk 19787
rmgr: Heap len (rec/tot): 14/ 4657, tx: 10684, lsn: 1/F00B2628, prev 1/F00B13D8, desc: UPDATE off 2 xmax 10684 ; new off 4 xmax 0, blkref #0: rel 1663/16384/16422 blk 20054 FPW, blkref #1: rel 1663/16384/16422 blk 19774
rmgr: Btr
.......
rmgr: Heap2 len (rec/tot): 8/ 68, tx: 0, lsn: 1/FCBD7448, prev 1/FCBD7400, desc: CLEAN remxid 10903, blkref #0: rel 1663/16384/16422 blk 2001
rmgr: Heap2 len (rec/tot): 8/ 66, tx: 0, lsn: 1/FCBD7490, prev 1/FCBD7448, desc: CLEAN remxid 10903, blkref #0: rel 1663/16384/16422 blk 2003
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn: 1/FCBD74D8, prev 1/FCBD7490, desc: RUNNING_XACTS nextXid 10907 latestCompletedXid 10906 oldestRunningXid 10907
rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 1/FCBD7510, prev 1/FCBD74D8, desc: CHECKPOINT_ONLINE redo 1/F00A7448; tli 1; prev tli 1; fpw true; xid 0:10685; oid 24665; multi 1; offset 0; oldest xid 584 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 10682; online
rmgr: Heap len (rec/tot): 3/ 164, tx: 10907, lsn: 1/FCBD7580, prev 1/FCBD7510, desc: INSERT off 25, blkref #0: rel 1663/16384/16484 blk 16398
rmgr: Btree len (rec/tot): 2/ 64, tx: 10907, lsn: 1/FCBD7628, prev 1/FCBD7580, desc: INSERT_LEAF off 60, blkref #0: rel 1663/16384/16490 blk 2722
........
On 01/09/2017 06:14 AM, Tom DalPozzo wrote: > Hi, I need some clarifications about checkpoints. > Below here a log from my standby server when started and then some parts > of the interested WAL in the master's cluster obtained by pg_xlogdump. > Just to have an example to talk on. > > 1) I see: "LOG: redo starts at 1/F00A7448" . I was expecting a > checkpoint record around 1/F00A7448 but the related checkpoint record is > at lsn: 1/FCBD7510 instead. > I expected that because I read in the doc :"Checkpoints are points in > the sequence of transactions at which it is guaranteed that the heap and > index data files have been updated with ALL information written BEFORE > that checkpoint". > And I interpreted that as "All information written before that > checkpoint RECORD" but now I guess that one thing is a checkpoint point > and one thing is a checkpoint record. Right? > > 2) I see that a checkpoint position can be right in the middle of a > group of records related to a transaction (in the example, transaction > id 10684). So a checkpoint position is NOT a consistency state point, right? > > 3) According to doc at 29.5, in pg_control the position of the last > checkpoint record (not the checkpoint position itself) is saved right? > > 4) If I'm right at 2) then, between the checkpoint position (1/F00A7448 > ) and the checkpoint record position (1/FCBD7510) there must be a point > where the DB is in a consistency state. If not, in case of crash just > after writing the checkpoint record to the WAL and its position to > pg_control, the system would replay from the checkpoint position (known > by last checkpoint record) without finding a consistency state. Right? > > 5) How can we define, in terms of log records, a consistency state position? https://www.postgresql.org/docs/9.5/static/wal-internals.html "After a checkpoint has been made and the log flushed, the checkpoint's position is saved in the file pg_control. Therefore, at the start of recovery, the server first reads pg_control and then the checkpoint record; then it performs the REDO operation by scanning forward from the log position indicated in the checkpoint record. Because the entire content of data pages is saved in the log on the first page modification after a checkpoint (assuming full_page_writes is not disabled), all pages changed since the checkpoint will be restored to a consistent state." > > Best regards > Pupillo > > STANDBY SERVER LOG > LOG: redo starts at 1/F00A7448 > .... > LOG: consistent recovery state reached at 2/426DF28 > LOG: invalid record length at 2/426DF28: wanted 24, got 0 > LOG: database system is ready to accept read only connections > LOG: started streaming WAL from primary at 2/4000000 on timeline 1 > > > FROM PG_XLOGDUMP OF MASTER > ........ > rmgr: Heap len (rec/tot): 14/ 1186, tx: 10684, lsn: > 1/F009EEE0, prev 1/F009EEA8, desc: UPDATE off 1 xmax 10684 ; new off 3 > xmax 0, blkref #0: rel 1663/16384/16422 blk 20054, blkref #1: rel > 1663/16384/16422 blk 19774 > rmgr: Btree len (rec/tot): 2/ 64, tx: 10684, lsn: > 1/F009F388, prev 1/F009EEE0, desc: INSERT_LEAF off 2, blkref #0: rel > 1663/16384/16428 blk 711 > rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: > 1/F009F3C8, prev 1/F009F388, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16384/16484 blk 9663 FPW > rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: > 1/F00A13E8, prev 1/F009F3C8, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16384/16484 blk 9664 FPW > rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: > 1/F00A3408, prev 1/F00A13E8, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16384/16484 blk 9665 FPW > rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: > 1/F00A5428, prev 1/F00A3408, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16384/16484 blk 9666 FPW > rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: > 1/F00A7448, prev 1/F00A5428, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16384/16484 blk 9667 FPW > rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: > 1/F00A9468, prev 1/F00A7448, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16384/16484 blk 9668 FPW > rmgr: XLOG len (rec/tot): 0/ 8193, tx: 0, lsn: > 1/F00AB488, prev 1/F00A9468, desc: FPI_FOR_HINT , blkref #0: rel > 1663/16384/16484 blk 9669 FPW > rmgr: Heap len (rec/tot): 8/ 8063, tx: 10682, lsn: > 1/F00AD4A8, prev 1/F00AB488, desc: LOCK off 5: xid 10682: flags 0 > LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/16384/16422 blk 19787 FPW > rmgr: Heap len (rec/tot): 8/ 8063, tx: 10684, lsn: > 1/F00AF440, prev 1/F00AD4A8, desc: LOCK off 2: xid 10684: flags 0 > LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/16384/16422 blk 19774 FPW > rmgr: Heap len (rec/tot): 14/ 4657, tx: 10682, lsn: > 1/F00B13D8, prev 1/F00AF440, desc: UPDATE off 5 xmax 10682 ; new off 4 > xmax 0, blkref #0: rel 1663/16384/16422 blk 20075 FPW, blkref #1: rel > 1663/16384/16422 blk 19787 > rmgr: Heap len (rec/tot): 14/ 4657, tx: 10684, lsn: > 1/F00B2628, prev 1/F00B13D8, desc: UPDATE off 2 xmax 10684 ; new off 4 > xmax 0, blkref #0: rel 1663/16384/16422 blk 20054 FPW, blkref #1: rel > 1663/16384/16422 blk 19774 > rmgr: Btr > ....... > rmgr: Heap2 len (rec/tot): 8/ 68, tx: 0, lsn: > 1/FCBD7448, prev 1/FCBD7400, desc: CLEAN remxid 10903, blkref #0: rel > 1663/16384/16422 blk 2001 > rmgr: Heap2 len (rec/tot): 8/ 66, tx: 0, lsn: > 1/FCBD7490, prev 1/FCBD7448, desc: CLEAN remxid 10903, blkref #0: rel > 1663/16384/16422 blk 2003 > rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn: > 1/FCBD74D8, prev 1/FCBD7490, desc: RUNNING_XACTS nextXid 10907 > latestCompletedXid 10906 oldestRunningXid 10907 > rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: > 1/FCBD7510, prev 1/FCBD74D8, desc: CHECKPOINT_ONLINE redo 1/F00A7448; > tli 1; prev tli 1; fpw true; xid 0:10685; oid 24665; multi 1; offset 0; > oldest xid 584 in DB 1; oldest multi 1 in DB 1; oldest/newest commit > timestamp xid: 0/0; oldest running xid 10682; online > rmgr: Heap len (rec/tot): 3/ 164, tx: 10907, lsn: > 1/FCBD7580, prev 1/FCBD7510, desc: INSERT off 25, blkref #0: rel > 1663/16384/16484 blk 16398 > rmgr: Btree len (rec/tot): 2/ 64, tx: 10907, lsn: > 1/FCBD7628, prev 1/FCBD7580, desc: INSERT_LEAF off 60, blkref #0: rel > 1663/16384/16490 blk 2722 > ........ > -- Adrian Klaver adrian.klaver@aklaver.com
https://www.postgresql.org/docs/9.5/static/wal-internals.htm l
"After a checkpoint has been made and the log flushed, the checkpoint's position is saved in the file pg_control. Therefore, at the start of recovery, the server first reads pg_control and then the checkpoint record; then it performs the REDO operation by scanning forward from the log position indicated in the checkpoint record. Because the entire content of data pages is saved in the log on the first page modification after a checkpoint (assuming full_page_writes is not disabled), all pages changed since the checkpoint will be restored to a consistent state."
Hi, yes I know that, it's what I meant in my point 3). As it says first "the checkpoint's position (NOT the record)is saved in the file pg_control" then I had some doubt.
Regards
Pupillo
On 01/09/2017 06:47 AM, Tom DalPozzo wrote: > https://www.postgresql.org/docs/9.5/static/wal-internals.html > <https://www.postgresql.org/docs/9.5/static/wal-internals.html> > > > "After a checkpoint has been made and the log flushed, the > checkpoint's position is saved in the file pg_control. Therefore, at > the start of recovery, the server first reads pg_control and then > the checkpoint record; then it performs the REDO operation by > scanning forward from the log position indicated in the checkpoint > record. Because the entire content of data pages is saved in the log > on the first page modification after a checkpoint (assuming > full_page_writes is not disabled), all pages changed since the > checkpoint will be restored to a consistent state." > > > Hi, yes I know that, it's what I meant in my point 3). As it says first > "the checkpoint's position (NOT the record)is saved in the file > pg_control" then I had some doubt. Yes it is just one piece of information stored in the file. To see what else is stored there do: pg_controldata -D your_cluster_data_directory > Regards > Pupillo > > -- Adrian Klaver adrian.klaver@aklaver.com
On 01/09/2017 06:47 AM, Tom DalPozzo wrote: > https://www.postgresql.org/docs/9.5/static/wal-internals.html > <https://www.postgresql.org/docs/9.5/static/wal-internals.html> > > > "After a checkpoint has been made and the log flushed, the > checkpoint's position is saved in the file pg_control. Therefore, at > the start of recovery, the server first reads pg_control and then > the checkpoint record; then it performs the REDO operation by > scanning forward from the log position indicated in the checkpoint > record. Because the entire content of data pages is saved in the log > on the first page modification after a checkpoint (assuming > full_page_writes is not disabled), all pages changed since the > checkpoint will be restored to a consistent state." > > > Hi, yes I know that, it's what I meant in my point 3). As it says first > "the checkpoint's position (NOT the record)is saved in the file > pg_control" then I had some doubt. Reread your original post and realized you where also asking about transaction consistency and WALs. The thumbnail version is that Postgres writes transactions to the WALs before they are written to the data files on disk. A checkpoint represents a point in the sequence when is is known that the changes recorded in the WAL have been also recorded in the disk data files. So Postgres then knows that in a recovery scenario it needs to only redo/replay the WAL changes that are past the last checkpoint. So the transactions are there it is just a matter of if they need to be replayed or not. This is subject to caveats: https://www.postgresql.org/docs/9.5/static/wal-reliability.html > Regards > Pupillo > > -- Adrian Klaver adrian.klaver@aklaver.com
Reread your original post and realized you where also asking about transaction consistency and WALs. The thumbnail version is that Postgres writes transactions to the WALs before they are written to the data files on disk. A checkpoint represents a point in the sequence when is is known that the changes recorded in the WAL have been also recorded in the disk data files. So Postgres then knows that in a recovery scenario it needs to only redo/replay the WAL changes that are past the last checkpoint. So the transactions are there it is just a matter of if they need to be replayed or not. This is subject to caveats:
https://www.postgresql.org/docs/9.5/static/wal-reliability.h tml
Hi, I had already read that doc but I can't answer clearly to my questions 2,4 and 5.
Regards
Pupillo
On 01/09/2017 01:10 PM, Tom DalPozzo wrote: > Reread your original post and realized you where also asking > about transaction consistency and WALs. The thumbnail version is > that Postgres writes transactions to the WALs before they are > written to the data files on disk. A checkpoint represents a > point in the sequence when is is known that the changes recorded > in the WAL have been also recorded in the disk data files. So > Postgres then knows that in a recovery scenario it needs to only > redo/replay the WAL changes that are past the last checkpoint. > So the transactions are there it is just a matter of if they > need to be replayed or not. This is subject to caveats: > > > https://www.postgresql.org/docs/9.5/static/wal-reliability.html > <https://www.postgresql.org/docs/9.5/static/wal-reliability.html> > > > Hi, I had already read that doc but I can't answer clearly to my > questions 2,4 and 5. The answer would seem to depend on what you consider 'a consistency state position'. Is it possible to be more explicit about what you mean? > Regards > Pupillo > > -- Adrian Klaver adrian.klaver@aklaver.com
Tom DalPozzo wrote: > 2) I see that a checkpoint position can be right in the middle of a group > of records related to a transaction (in the example, transaction id 10684). > So a checkpoint position is NOT a consistency state point, right? > 4) If I'm right at 2) then, between the checkpoint position (1/F00A7448 ) > and the checkpoint record position (1/FCBD7510) there must be a point where > the DB is in a consistency state. If not, in case of crash just after > writing the checkpoint record to the WAL and its position to pg_control, > the system would replay from the checkpoint position (known by last > checkpoint record) without finding a consistency state. Right? > > 5) How can we define, in terms of log records, a consistency state position? Whether any individual tuple in the data files is visible or not depends not only on the data itself, but also on the commit status of the transactions that created it (and deleted it, if any). Replaying WAL also updates the commit status of transactions, so if you're in the middle of replaying WAL, you may be adding tuples to the data files, but those tuples will not become visible until their commit records are also updated. You can stop replaying WAL at any point, and data will always be in a consistent state. Some data tuples might be "from the future" and those will not be visible, which is what makes it all consistent. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Whether any individual tuple in the data files is visible or not depends
not only on the data itself, but also on the commit status of the
transactions that created it (and deleted it, if any). Replaying WAL
also updates the commit status of transactions, so if you're in the
middle of replaying WAL, you may be adding tuples to the data files, but
those tuples will not become visible until their commit records are also
updated.
You can stop replaying WAL at any point, and data will always be in a
consistent state. Some data tuples might be "from the future" and those
will not be visible, which is what makes it all consistent.
Hi,
so let's suppose that the WAL is:
LSN 10: start transaction 123
LSN 11: update tuple 100
checkpoint position here (not a record but just for understanding)
LSN 12: update tuple 100
LSN 13: update tuple 100
LSN 14: checkpoint record ( postion=11)
LSN 15: update tuple 100
and that the system crashes now, before ending to write all the transaction's recs to the WAL (other updates and commit record missing).
At the replay, starting from LSN 12, the entire page we had at LSN 11 is written to the disk, though carrying inconsistent data.
Then we can even replay up to the end of WAL but always getting inconsistent data.
BUT, you say, as the tuple is not commited in the WAL, only the old version of the tuple will be visible? Right?
Regards
Pupillo
Tom DalPozzo wrote: > Hi, > so let's suppose that the WAL is: > LSN 10: start transaction 123 > LSN 11: update tuple 100 > checkpoint position here (not a record but just for understanding) > LSN 12: update tuple 100 > LSN 13: update tuple 100 > LSN 14: checkpoint record ( postion=11) > LSN 15: update tuple 100 > and that the system crashes now, before ending to write all the > transaction's recs to the WAL (other updates and commit record missing). > > At the replay, starting from LSN 12, the entire page we had at LSN 11 is > written to the disk, though carrying inconsistent data. > Then we can even replay up to the end of WAL but always getting > inconsistent data. > BUT, you say, as the tuple is not commited in the WAL, only the old version > of the tuple will be visible? Right? Yes -- all the updated tuples are invisible because the commit record for transaction 123 does not appear in wal. A future VACUUM will remove all those tuples. Note that precisely for this reason, the original version of the tuple had not been removed yet. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> Hi,
> so let's suppose that the WAL is:
> LSN 10: start transaction 123
> LSN 11: update tuple 100
> checkpoint position here (not a record but just for understanding)
> LSN 12: update tuple 100
> LSN 13: update tuple 100
> LSN 14: checkpoint record ( postion=11)
> LSN 15: update tuple 100
> and that the system crashes now, before ending to write all the
> transaction's recs to the WAL (other updates and commit record missing).
>
> At the replay, starting from LSN 12, the entire page we had at LSN 11 is
> written to the disk, though carrying inconsistent data.
> Then we can even replay up to the end of WAL but always getting
> inconsistent data.
> BUT, you say, as the tuple is not commited in the WAL, only the old version
> of the tuple will be visible? Right?
Yes -- all the updated tuples are invisible because the commit record
for transaction 123 does not appear in wal. A future VACUUM will remove
all those tuples. Note that precisely for this reason, the original
version of the tuple had not been removed yet.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Ok! Now many things are clear to me....
Thank you very much
Pupillo
Hi, I had already read that doc but I can't answer clearly to my
questions 2,4 and 5.
The answer would seem to depend on what you consider 'a consistency state position'. Is it possible to be more explicit about what you mean?
Hi, I meant a position such that, if you replay up to it, then the DB is in a consistent state (transactions done entirely or not a t all...).
But, as Alvaro wrote, any position is ok
Thank you very much
Pupillo