Thread: testing hot standby
Hi, i'm startint to try Hot Standby & Streaming Replication, so i started a replication: 1) Install master server with regression database 2) Start WAL archive (archive_mode=on, archive_command='cp %p /usr/local/pgsql/wal_archive/%f') 3) select pg_start_backup('standby test'); 4) cp -R /usr/local/pgsql/9.0/data /usr/local/pgsql/9.0slave/data 5) select pg_stop_backup(); at this point i checked wal_archive directory: """ postgres@casanova14:/usr/local/pgsql/9.0$ ls ../wal_archive/ 000000010000000000000003 000000010000000000000004 000000010000000000000004.00000020.backup """ 6) started standby recovery (archive_mode=off, standy_mode=on, primary_conninfo = 'host=127.0.0.1 port=5432 user=postgres') wait a little and check logs: """ LOG: database system was interrupted; last known up at 2010-04-09 14:48:16 ECT LOG: entering standby mode LOG: restored log file "000000010000000000000004" from archive LOG: redo starts at 0/4000020 LOG: consistent recovery state reached at 0/5000000 LOG: database system is ready to accept read only connections LOG: restored log file "000000010000000000000005" from archive cp: no se puede efectuar `stat' sobre «/usr/local/pgsql/wal_archive/000000010000000000000006»: No existe el fichero ó directorio LOG: unexpected pageaddr 0/2000000 in log file 0, segment 6, offset 0 cp: no se puede efectuar `stat' sobre «/usr/local/pgsql/wal_archive/000000010000000000000006»: No existe el fichero ó directorio LOG: streaming replication successfully connected to primary """ mmm... are we waiting for a WAL file that doesn't exist? 7) i then, restart standby server """ LOG: received smart shutdown request FATAL: terminating walreceiver process due to administrator command LOG: shutting down LOG: database system is shut down LOG: database system was interrupted while in recovery at log time 2010-04-09 15:06:23 ECT HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. LOG: entering standby mode cp: no se puede efectuar `stat' sobre «/usr/local/pgsql/wal_archive/000000010000000000000006»: No existe el fichero ó directorio LOG: invalid record length at 0/6000080 cp: no se puede efectuar `stat' sobre «/usr/local/pgsql/wal_archive/000000010000000000000006»: No existe el fichero ó directorio """ 8) i initialize pgbench tables, which create missing WAL files (bin/pgbench -i) and then it could connect to the primary, and some minutes later it could accept connections """ LOG: streaming replication successfully connected to primary FATAL: the database system is starting up FATAL: the database system is starting up FATAL: the database system is starting up LOG: redo starts at 0/6000080 LOG: consistent recovery state reached at 0/60000A0 FATAL: the database system is starting up LOG: database system is ready to accept read only connections """ but, my main concern is why it was asking for "000000010000000000000006"? is this normal? is this standby's way of saying i'm working but i have nothing to do? when that happens after a standby restart, is normal that i have to wait until the file is created before it can accept connections? sorry, if this questions sound very simple but i haven't following all the design details :) -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On Fri, Apr 9, 2010 at 3:39 PM, Jaime Casanova <jcasanov@systemguards.com.ec> wrote: > Hi, > > i'm startint to try Hot Standby & Streaming Replication, so i started > a replication: > i think "make standbycheck" needs a little more work, why it isn't accesible from top of source dir? For now, to excercise it i have to do (on the standby server): - cd src/test/regress - psql -h ip_primary -f sql/hs_primary_setup.sql - make standbycheck (and 2 tests fail for differences in the messages, patch to fix attached) just an idea, can't we use the info about primary_conninfo to know how to execute the script hs_primary_setup.sql on the primary? -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
Attachment
On Sat, Apr 10, 2010 at 5:39 AM, Jaime Casanova <jcasanov@systemguards.com.ec> wrote: > i'm startint to try Hot Standby & Streaming Replication, so i started > a replication: Great! > but, my main concern is why it was asking for > "000000010000000000000006"? is this normal? The standby server tries to replay all of the available WAL files in the archive and pg_xlog directory at first. Then, when an invalid record is found or no more WAL file is available, it starts streaming replication and tries to read the missing WAL files from the primary. In your testing, an invalid record was found in 000000010000000000000006, then the standby was waiting for it to be shipped from the primary. > is this standby's way of > saying i'm working but i have nothing to do? Since SR is a record-based log-shipping, the standby can receive the WAL records from the primary before the WAL file has been filled up. So, in your case, I guess the standby was receiving the WAL records which belong to 000000010000000000000006 from the primary. > when that happens after a standby restart, is normal that i have to > wait until the file is created before it can accept connections? No. All the WAL records for the standby to accept connections should be shipped before any additional WAL records are written. Didn't the standby accept connections before executing pgbench? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Apr 9, 2010 at 3:39 PM, Jaime Casanova <jcasanov@systemguards.com.ec> wrote: > > but, my main concern is why it was asking for > "000000010000000000000006"? is this normal? is this standby's way of > saying i'm working but i have nothing to do? > when that happens after a standby restart, is normal that i have to > wait until the file is created before it can accept connections? > ok, i see this again in a new env. seems like this happen when i shutdown standby and primary (in that order) after making some WAL-logged action on the primary an then start again primary and standby (in that order)... it doesn't occur always but it does occur too often, still i'm not sure what is the key factor that triggers this standby waits for a file that doesn't exist to reach a consistent state (last time i wait for an hour after i force a WAL-logged action), here is an extract of the message on standby's log: """ postgres@casanova1:/usr/local/pgsql/9.0slave$ cat data/pg_log/postgresql-2010-04-12_000947.log LOG: database system was interrupted while in recovery at log time 2010-04-11 20:44:09 GMT HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. LOG: entering standby mode LOG: restored log file "000000010000000E00000014" from archive LOG: redo starts at E/14000088 LOG: consistent recovery state reached at E/15000000 cp: no se puede efectuar `stat' sobre «/usr/local/pgsql/wal_archive/000000010000000E00000015»: No existe el fichero o el directorio LOG: unexpected pageaddr D/EE000000 in log file 14, segment 21, offset 0 cp: no se puede efectuar `stat' sobre «/usr/local/pgsql/wal_archive/000000010000000E00000015»: No existe el fichero o el directorio LOG: streaming replication successfully connected to primary """ another point, what happened with this: http://archives.postgresql.org/message-id/1229549172.4793.105.camel@ebony.2ndQuadrant? Obviously we still have the problem with hash indexes, and in that thread Tom advice was just to document the issue and while that could be fine at least we should be emitting better messages, consider this one that i got on the standby server (where 4658650 is the oid of a hash index): """ mic=# explain analyze select * from tt1 where col1 = 5000; ERROR: could not read block 0 in file "base/21958/4658650": read only 0 of 8192 bytes """ -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On Mon, Apr 12, 2010 at 1:21 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Didn't the standby > accept connections before executing pgbench? > nop, and last time i try it was in that state for an hour (without accepting connections)... after that i execute on the primary: CREATE TABLE tt2 AS SELECT generate_series(1, 100) as i After that, the standby start accepting connections -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On Mon, Apr 12, 2010 at 3:29 PM, Jaime Casanova <jcasanov@systemguards.com.ec> wrote: > On Mon, Apr 12, 2010 at 1:21 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Didn't the standby >> accept connections before executing pgbench? >> > > nop, and last time i try it was in that state for an hour (without > accepting connections)... after that i execute on the primary: CREATE > TABLE tt2 AS SELECT generate_series(1, 100) as i > After that, the standby start accepting connections OK. Your reproduction scenario is the following? If not, could you show me the complete scenario? 1. start the primary 2. pg_start_backup() 3. copy $PGDATA from the primary to the standby 4. pg_stop_backup(); 5. create the recovery.conf and start the standby 6. shutdown (smart mode) the standby 7. start the standby again --> cannot accept connnections until new WAL has been created Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Jaime Casanova wrote: > On Fri, Apr 9, 2010 at 3:39 PM, Jaime Casanova > <jcasanov@systemguards.com.ec> wrote: >> but, my main concern is why it was asking for >> "000000010000000000000006"? is this normal? is this standby's way of >> saying i'm working but i have nothing to do? Yes. >> when that happens after a standby restart, is normal that i have to >> wait until the file is created before it can accept connections? > > ok, i see this again in a new env. seems like this happen when i > shutdown standby and primary (in that order) after making some > WAL-logged action on the primary an then start again primary and > standby (in that order)... it doesn't occur always but it does occur > too often, still i'm not sure what is the key factor that triggers > this Perhaps you're being bitten by the "can't start hot standby from a shutdown checkpoint" issue I've complained for a long time. There's a pending patch for that, see http://archives.postgresql.org/pgsql-hackers/2010-04/msg00222.php. If that's what's hitting you, the way to reproduce is: 1. shut down primary 2. shut down standby 3. start standby 4. start primary. The standby will get hung until it receives an online checkpoint record from the primary. Connecting to the primary and issuing a manual CHECKPOINT helps. That's not the order of shutdowns you described, though... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Apr 12, 2010 at 1:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Apr 12, 2010 at 3:29 PM, Jaime Casanova > <jcasanov@systemguards.com.ec> wrote: >> On Mon, Apr 12, 2010 at 1:21 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> Didn't the standby >>> accept connections before executing pgbench? >>> >> >> nop, and last time i try it was in that state for an hour (without >> accepting connections)... after that i execute on the primary: CREATE >> TABLE tt2 AS SELECT generate_series(1, 100) as i >> After that, the standby start accepting connections > > OK. Your reproduction scenario is the following? > If not, could you show me the complete scenario? > > 1. start the primary > 2. pg_start_backup() > 3. copy $PGDATA from the primary to the standby > 4. pg_stop_backup(); > 5. create the recovery.conf and start the standby execute some WAL-logged action (i've seen this happen even with no WAL-logged action if i wait for a while before shutdown servers) > 6. shutdown (smart mode) the standby shutdown (smart) the primary start the primary again > 7. start the standby again > --> cannot accept connnections until new WAL has been created > a manual CHECKPOINT on the primary does help -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On Mon, Apr 12, 2010 at 9:27 AM, Jaime Casanova <jcasanov@systemguards.com.ec> wrote: > On Mon, Apr 12, 2010 at 1:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Mon, Apr 12, 2010 at 3:29 PM, Jaime Casanova >> <jcasanov@systemguards.com.ec> wrote: >>> On Mon, Apr 12, 2010 at 1:21 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> Didn't the standby >>>> accept connections before executing pgbench? >>>> >>> >>> nop, and last time i try it was in that state for an hour (without >>> accepting connections)... after that i execute on the primary: CREATE >>> TABLE tt2 AS SELECT generate_series(1, 100) as i >>> After that, the standby start accepting connections >> >> OK. Your reproduction scenario is the following? >> If not, could you show me the complete scenario? >> >> 1. start the primary >> 2. pg_start_backup() >> 3. copy $PGDATA from the primary to the standby >> 4. pg_stop_backup(); >> 5. create the recovery.conf and start the standby > > execute some WAL-logged action (i've seen this happen even with no > WAL-logged action if i wait for a while before shutdown servers) > >> 6. shutdown (smart mode) the standby > > shutdown (smart) the primary > start the primary again > >> 7. start the standby again i guess, this is because the primary is in recovery when the standby tries to connect to it, and it should wait until the primary is ready but seems like the primary is failing to advertise itself and the standby doesn't recheck the condition... could be? -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On Mon, Apr 12, 2010 at 11:27 PM, Jaime Casanova <jcasanov@systemguards.com.ec> wrote: >> 1. start the primary >> 2. pg_start_backup() >> 3. copy $PGDATA from the primary to the standby >> 4. pg_stop_backup(); >> 5. create the recovery.conf and start the standby > > execute some WAL-logged action (i've seen this happen even with no > WAL-logged action if i wait for a while before shutdown servers) > >> 6. shutdown (smart mode) the standby > > shutdown (smart) the primary > start the primary again > >> 7. start the standby again >> --> cannot accept connnections until new WAL has been created I was not able to reproduce the problem using the above scenario. But when I did one more restart of the primary and standby, I was able to observe the problem. If this is the same as you encountered, it would be the "can't start hot standby from a shutdown checkpoint" issue that Heikki pointed out. So it's very helpful to check whether the posted patch fixes your problem or not. http://archives.postgresql.org/pgsql-hackers/2010-04/msg00407.php Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Apr 12, 2010 at 8:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Apr 12, 2010 at 11:27 PM, Jaime Casanova > <jcasanov@systemguards.com.ec> wrote: >>> 1. start the primary >>> 2. pg_start_backup() >>> 3. copy $PGDATA from the primary to the standby >>> 4. pg_stop_backup(); >>> 5. create the recovery.conf and start the standby >> >> execute some WAL-logged action (i've seen this happen even with no >> WAL-logged action if i wait for a while before shutdown servers) >> >>> 6. shutdown (smart mode) the standby >> >> shutdown (smart) the primary >> start the primary again >> >>> 7. start the standby again >>> --> cannot accept connnections until new WAL has been created > > I was not able to reproduce the problem using the above scenario. > But when I did one more restart of the primary and standby, I was > able to observe the problem. If this is the same as you encountered, > it would be the "can't start hot standby from a shutdown checkpoint" > issue that Heikki pointed out. So it's very helpful to check whether > the posted patch fixes your problem or not. > http://archives.postgresql.org/pgsql-hackers/2010-04/msg00407.php > this patch seems to fix the problem for me... i will read it on the morning and the thread where it is, something that seems strange to me is that the patch touch twophase.c and twophase.h, why? -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
Jaime Casanova wrote: > On Mon, Apr 12, 2010 at 8:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> But when I did one more restart of the primary and standby, I was >> able to observe the problem. If this is the same as you encountered, >> it would be the "can't start hot standby from a shutdown checkpoint" >> issue that Heikki pointed out. So it's very helpful to check whether >> the posted patch fixes your problem or not. >> http://archives.postgresql.org/pgsql-hackers/2010-04/msg00407.php > > this patch seems to fix the problem for me... Committed, thanks for the testing. > i will read it on the morning and the thread where it is, something > that seems strange to me is that the patch touch twophase.c and > twophase.h, why? When you start hot standby from an online checkpoint, the XIDs of any two-phase transactions are included in the running-xacts record, just like any other in-progress transactions. At a shutdown checkpoint, we know that no regular transactions are in-progress, but there can be transactions in prepared state, which need to be considered as in-progress in the standby, but there's no information about them in the shutdown record. So we scan pg_twophase to discover them. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Apr 13, 2010 at 11:06 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Jaime Casanova wrote: >> >> i will read it on the morning and the thread where it is, something >> that seems strange to me is that the patch touch twophase.c and >> twophase.h, why? > > When you start hot standby from an online checkpoint, the XIDs of any > two-phase transactions are included in the running-xacts record, just > like any other in-progress transactions. At a shutdown checkpoint, we > know that no regular transactions are in-progress, but there can be > transactions in prepared state, which need to be considered as > in-progress in the standby, but there's no information about them in the > shutdown record. So we scan pg_twophase to discover them. > > ah! that makes sense... thanks -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157