Thread: 9.3 to 9.5 upgrade problems
Hi all, I have a master (web1) and two slaves (web2, webserv), one slave is quite far from the master, the db is 112 Gig, so pg_basebackupis my last resort. I followed the page here: https://www.postgresql.org/docs/9.5/static/pgupgrade.html including the rsync stuff. I practiced it _twice_, once in PG 9.5 beta, and again a week ago, on two VM's I created locally. Both practice sessions worked perfect. I just ran it on the live databases. The master seems ok, its running PG 9.5 now, I can login to it, and no errors in thelog. Neither slave works. After I'd gotten done with the pgupgrade steps, both slaves gave me this error: FATAL: database system identifier differs between the primary and standby Sure enough pg_controldata show'd their database system id different (all three web1, web2, webserv were different. no matchesat all), so I'm assuming the rsync didnt rsync right, or I missed a step and ran it to early, or something ... I'mnot quite sure. I needed to get the live website back up and running again, so I let the master go, ran analyze, and when it was finished,used the steps here to try and resync: https://wiki.postgresql.org/wiki/Binary_Replication_Tutorial on Master: select pg_start_backup('clone',true); rsync -av --exclude pg_xlog --exclude postgresql.conf /pub/pg95/* web2:/pub/pg95/ select pg_stop_backup(); rsync -av /pub/pg95/pg_xlog web2:/pub/pg95/ That ran pretty quick, and pg_controldata shows matching numbers, but when I start the slave I get: ,,2016-07-03 06:06:57.173 CDT,: LOG: entering standby mode ,,2016-07-03 06:06:57.205 CDT,: LOG: redo starts at 369/D6002228 ,,2016-07-03 06:06:57.984 CDT,: LOG: consistent recovery state reached at 369/DCC5DB90 ,,2016-07-03 06:06:57.984 CDT,: LOG: database system is ready to accept read only connections ,,2016-07-03 06:06:57.984 CDT,: LOG: invalid record length at 369/DD038ED0 ,,2016-07-03 06:06:58.344 CDT,: LOG: started streaming WAL from primary at 369/DD000000 on timeline 1 web,[unknown],2016-07-03 06:07:11.176 CDT,[local]: FATAL: role "andy" does not exist I can login as myself on the master, but not on the slave. when I "psql -U postgres" on the slave I get: psql: FATAL: cache lookup failed for database 16401 This is only on web2, its close to web1, so I'm hoping I can get it fixed and then rsync it quickly to the far away slave. I'm at a loss here, any hints or suggestions would be appreciated. Thanks, -Andy
binary replication requires the versions be identical. Also, once you ran pg_upgrade you altered one of the copies so binary replication can no longer work on that either.
On Sun, Jul 3, 2016 at 11:06 AM, Andy Colson <andy@squeakycode.net> wrote:
Hi all,
I have a master (web1) and two slaves (web2, webserv), one slave is quite far from the master, the db is 112 Gig, so pg_basebackup is my last resort.
I followed the page here:
https://www.postgresql.org/docs/9.5/static/pgupgrade.html
including the rsync stuff. I practiced it _twice_, once in PG 9.5 beta, and again a week ago, on two VM's I created locally. Both practice sessions worked perfect.
I just ran it on the live databases. The master seems ok, its running PG 9.5 now, I can login to it, and no errors in the log.
Neither slave works. After I'd gotten done with the pgupgrade steps, both slaves gave me this error:
FATAL: database system identifier differs between the primary and standby
Sure enough pg_controldata show'd their database system id different (all three web1, web2, webserv were different. no matches at all), so I'm assuming the rsync didnt rsync right, or I missed a step and ran it to early, or something ... I'm not quite sure.
I needed to get the live website back up and running again, so I let the master go, ran analyze, and when it was finished, used the steps here to try and resync:
https://wiki.postgresql.org/wiki/Binary_Replication_Tutorial
on Master:
select pg_start_backup('clone',true);
rsync -av --exclude pg_xlog --exclude postgresql.conf /pub/pg95/* web2:/pub/pg95/
select pg_stop_backup();
rsync -av /pub/pg95/pg_xlog web2:/pub/pg95/
That ran pretty quick, and pg_controldata shows matching numbers, but when I start the slave I get:
,,2016-07-03 06:06:57.173 CDT,: LOG: entering standby mode
,,2016-07-03 06:06:57.205 CDT,: LOG: redo starts at 369/D6002228
,,2016-07-03 06:06:57.984 CDT,: LOG: consistent recovery state reached at 369/DCC5DB90
,,2016-07-03 06:06:57.984 CDT,: LOG: database system is ready to accept read only connections
,,2016-07-03 06:06:57.984 CDT,: LOG: invalid record length at 369/DD038ED0
,,2016-07-03 06:06:58.344 CDT,: LOG: started streaming WAL from primary at 369/DD000000 on timeline 1
web,[unknown],2016-07-03 06:07:11.176 CDT,[local]: FATAL: role "andy" does not exist
I can login as myself on the master, but not on the slave. when I "psql -U postgres" on the slave I get:
psql: FATAL: cache lookup failed for database 16401
This is only on web2, its close to web1, so I'm hoping I can get it fixed and then rsync it quickly to the far away slave.
I'm at a loss here, any hints or suggestions would be appreciated.
Thanks,
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
> On Sun, Jul 3, 2016 at 11:06 AM, Andy Colson <andy@squeakycode.net <mailto:andy@squeakycode.net>> wrote: > > Hi all, > > I have a master (web1) and two slaves (web2, webserv), one slave is quite far from the master, the db is 112 Gig, sopg_basebackup is my last resort. > > I followed the page here: > https://www.postgresql.org/docs/9.5/static/pgupgrade.html > > including the rsync stuff. I practiced it _twice_, once in PG 9.5 beta, and again a week ago, on two VM's I createdlocally. Both practice sessions worked perfect. > On 07/03/2016 10:11 AM, Vick Khera wrote: > binary replication requires the versions be identical. Also, once you ran pg_upgrade you altered one of the copies so binaryreplication can no longer work on that either. > Yes, all three boxes are running Pg 9.5. I've uninstalled the PG 9.3 package, and delete the /pub/pg93 database directory. My rsync copied the pg95 folder from web1 to web2. -Andy
On 07/03/2016 08:06 AM, Andy Colson wrote: > Hi all, > > I have a master (web1) and two slaves (web2, webserv), one slave is > quite far from the master, the db is 112 Gig, so pg_basebackup is my > last resort. > > I followed the page here: > https://www.postgresql.org/docs/9.5/static/pgupgrade.html > > including the rsync stuff. I practiced it _twice_, once in PG 9.5 beta, > and again a week ago, on two VM's I created locally. Both practice > sessions worked perfect. > > I just ran it on the live databases. The master seems ok, its running > PG 9.5 now, I can login to it, and no errors in the log. > > Neither slave works. After I'd gotten done with the pgupgrade steps, > both slaves gave me this error: > > FATAL: database system identifier differs between the primary and standby > > Sure enough pg_controldata show'd their database system id different > (all three web1, web2, webserv were different. no matches at all), so > I'm assuming the rsync didnt rsync right, or I missed a step and ran it > to early, or something ... I'm not quite sure. > > I needed to get the live website back up and running again, so I let the > master go, ran analyze, and when it was finished, used the steps here to > try and resync: > > https://wiki.postgresql.org/wiki/Binary_Replication_Tutorial > > on Master: > select pg_start_backup('clone',true); > rsync -av --exclude pg_xlog --exclude postgresql.conf /pub/pg95/* > web2:/pub/pg95/ > select pg_stop_backup(); > rsync -av /pub/pg95/pg_xlog web2:/pub/pg95/ Not sure about above rsync, that seems to undo what you did previously. Also was the remote directory empty when you did this? > > > That ran pretty quick, and pg_controldata shows matching numbers, but > when I start the slave I get: > > ,,2016-07-03 06:06:57.173 CDT,: LOG: entering standby mode > ,,2016-07-03 06:06:57.205 CDT,: LOG: redo starts at 369/D6002228 > ,,2016-07-03 06:06:57.984 CDT,: LOG: consistent recovery state reached > at 369/DCC5DB90 > ,,2016-07-03 06:06:57.984 CDT,: LOG: database system is ready to accept > read only connections > ,,2016-07-03 06:06:57.984 CDT,: LOG: invalid record length at 369/DD038ED0 > ,,2016-07-03 06:06:58.344 CDT,: LOG: started streaming WAL from primary > at 369/DD000000 on timeline 1 > web,[unknown],2016-07-03 06:07:11.176 CDT,[local]: FATAL: role "andy" > does not exist > > I can login as myself on the master, but not on the slave. when I "psql > -U postgres" on the slave I get: > > psql: FATAL: cache lookup failed for database 16401 > > This is only on web2, its close to web1, so I'm hoping I can get it > fixed and then rsync it quickly to the far away slave. > > I'm at a loss here, any hints or suggestions would be appreciated. > > Thanks, > > -Andy > > -- Adrian Klaver adrian.klaver@aklaver.com
On 07/03/2016 10:35 AM, Adrian Klaver wrote: > On 07/03/2016 08:06 AM, Andy Colson wrote: >> Hi all, >> >> I have a master (web1) and two slaves (web2, webserv), one slave is >> quite far from the master, the db is 112 Gig, so pg_basebackup is my >> last resort. >> >> I followed the page here: >> https://www.postgresql.org/docs/9.5/static/pgupgrade.html >> >> including the rsync stuff. I practiced it _twice_, once in PG 9.5 beta, >> and again a week ago, on two VM's I created locally. Both practice >> sessions worked perfect. >> >> I just ran it on the live databases. The master seems ok, its running >> PG 9.5 now, I can login to it, and no errors in the log. >> >> Neither slave works. After I'd gotten done with the pgupgrade steps, >> both slaves gave me this error: >> >> FATAL: database system identifier differs between the primary and standby >> >> Sure enough pg_controldata show'd their database system id different >> (all three web1, web2, webserv were different. no matches at all), so >> I'm assuming the rsync didnt rsync right, or I missed a step and ran it >> to early, or something ... I'm not quite sure. >> >> I needed to get the live website back up and running again, so I let the >> master go, ran analyze, and when it was finished, used the steps here to >> try and resync: >> >> https://wiki.postgresql.org/wiki/Binary_Replication_Tutorial >> >> on Master: >> select pg_start_backup('clone',true); >> rsync -av --exclude pg_xlog --exclude postgresql.conf /pub/pg95/* >> web2:/pub/pg95/ >> select pg_stop_backup(); >> rsync -av /pub/pg95/pg_xlog web2:/pub/pg95/ > > Not sure about above rsync, that seems to undo what you did previously. > > Also was the remote directory empty when you did this? > Not sure what you mean by undo. pgupgrade.html page, step 10, has you rsync the master to the slave, so the pg95 directoryis hard linked to the pg93, which save's a ton to bandwidth when your servers are cross county. My second rsync did the same thing, but only on the pg95 directory (my db lives in /pub/pg95). No, the directory was not empty, and I'm really trying to avoid a fresh copy of 112 Gig. -Andy
On 07/03/2016 08:49 AM, Andy Colson wrote: > On 07/03/2016 10:35 AM, Adrian Klaver wrote: >> On 07/03/2016 08:06 AM, Andy Colson wrote: >>> Hi all, >>> >>> I have a master (web1) and two slaves (web2, webserv), one slave is >>> quite far from the master, the db is 112 Gig, so pg_basebackup is my >>> last resort. >>> >>> I followed the page here: >>> https://www.postgresql.org/docs/9.5/static/pgupgrade.html >>> >>> including the rsync stuff. I practiced it _twice_, once in PG 9.5 beta, >>> and again a week ago, on two VM's I created locally. Both practice >>> sessions worked perfect. >>> >>> I just ran it on the live databases. The master seems ok, its running >>> PG 9.5 now, I can login to it, and no errors in the log. >>> >>> Neither slave works. After I'd gotten done with the pgupgrade steps, >>> both slaves gave me this error: >>> >>> FATAL: database system identifier differs between the primary and >>> standby >>> >>> Sure enough pg_controldata show'd their database system id different >>> (all three web1, web2, webserv were different. no matches at all), so >>> I'm assuming the rsync didnt rsync right, or I missed a step and ran it >>> to early, or something ... I'm not quite sure. >>> >>> I needed to get the live website back up and running again, so I let the >>> master go, ran analyze, and when it was finished, used the steps here to >>> try and resync: >>> >>> https://wiki.postgresql.org/wiki/Binary_Replication_Tutorial >>> >>> on Master: >>> select pg_start_backup('clone',true); >>> rsync -av --exclude pg_xlog --exclude postgresql.conf /pub/pg95/* >>> web2:/pub/pg95/ >>> select pg_stop_backup(); >>> rsync -av /pub/pg95/pg_xlog web2:/pub/pg95/ >> >> Not sure about above rsync, that seems to undo what you did previously. >> >> Also was the remote directory empty when you did this? >> > > Not sure what you mean by undo. pgupgrade.html page, step 10, has you > rsync the master to the slave, so the pg95 directory is hard linked to > the pg93, which save's a ton to bandwidth when your servers are cross > county. I understand I am just trying to figure out what mixing methods (pg-upgrade, pg_start_backup) is doing? In particular the section on pg_start_backup: https://www.postgresql.org/docs/9.5/static/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP starts with: "Ensure that WAL archiving is enabled and working." and from I gather that is not the case. > > My second rsync did the same thing, but only on the pg95 directory (my > db lives in /pub/pg95). > > No, the directory was not empty, and I'm really trying to avoid a fresh > copy of 112 Gig. > > -Andy > > > -- Adrian Klaver adrian.klaver@aklaver.com
On 07/03/2016 11:04 AM, Adrian Klaver wrote: > On 07/03/2016 08:49 AM, Andy Colson wrote: >> On 07/03/2016 10:35 AM, Adrian Klaver wrote: >>> On 07/03/2016 08:06 AM, Andy Colson wrote: >>>> Hi all, >>>> >>>> I have a master (web1) and two slaves (web2, webserv), one slave is >>>> quite far from the master, the db is 112 Gig, so pg_basebackup is my >>>> last resort. >>>> >>>> I followed the page here: >>>> https://www.postgresql.org/docs/9.5/static/pgupgrade.html >>>> >>>> including the rsync stuff. I practiced it _twice_, once in PG 9.5 beta, >>>> and again a week ago, on two VM's I created locally. Both practice >>>> sessions worked perfect. >>>> >>>> I just ran it on the live databases. The master seems ok, its running >>>> PG 9.5 now, I can login to it, and no errors in the log. >>>> >>>> Neither slave works. After I'd gotten done with the pgupgrade steps, >>>> both slaves gave me this error: >>>> >>>> FATAL: database system identifier differs between the primary and >>>> standby >>>> >>>> Sure enough pg_controldata show'd their database system id different >>>> (all three web1, web2, webserv were different. no matches at all), so >>>> I'm assuming the rsync didnt rsync right, or I missed a step and ran it >>>> to early, or something ... I'm not quite sure. >>>> >>>> I needed to get the live website back up and running again, so I let the >>>> master go, ran analyze, and when it was finished, used the steps here to >>>> try and resync: >>>> >>>> https://wiki.postgresql.org/wiki/Binary_Replication_Tutorial >>>> >>>> on Master: >>>> select pg_start_backup('clone',true); >>>> rsync -av --exclude pg_xlog --exclude postgresql.conf /pub/pg95/* >>>> web2:/pub/pg95/ >>>> select pg_stop_backup(); >>>> rsync -av /pub/pg95/pg_xlog web2:/pub/pg95/ >>> >>> Not sure about above rsync, that seems to undo what you did previously. >>> >>> Also was the remote directory empty when you did this? >>> >> >> Not sure what you mean by undo. pgupgrade.html page, step 10, has you >> rsync the master to the slave, so the pg95 directory is hard linked to >> the pg93, which save's a ton to bandwidth when your servers are cross >> county. > > I understand I am just trying to figure out what mixing methods (pg-upgrade, pg_start_backup) is doing? > > In particular the section on pg_start_backup: > > https://www.postgresql.org/docs/9.5/static/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP > > starts with: > > "Ensure that WAL archiving is enabled and working." > > and from I gather that is not the case. > The slave log says it reached a consistent state, and is accepting connections, which tells me it should run ok, maybe notwith the newest data, but at least not get: FATAL: cache lookup failed for database 16401 > > Also was the remote directory empty when you did this? Now that I think about this more, I think you're on to something. I'm trying to get an _exact_ copy of the master db ontothe slave. Checking rsync man, it matches only on size and modified time, and I didn't include deletes. I'm going to re-try with this: select pg_start_backup('clone',true); rsync -av --delete --checksum --exclude pg_xlog --exclude postgresql.conf /pub/pg95/* web2:/pub/pg95/ select pg_stop_backup(); rsync -av --delete --checksum /pub/pg95/pg_xlog web2:/pub/pg95/ That should make sure the copies are exact. -Andy
On 07/03/2016 11:42 AM, Andy Colson wrote: > > Now that I think about this more, I think you're on to something. > I'm trying to get an _exact_ copy of the master db onto the slave. > Checking rsync man, it matches only on size and modified time, and I > didn't include deletes. > > I'm going to re-try with this: > > select pg_start_backup('clone',true); rsync -av --delete --checksum > --exclude pg_xlog --exclude postgresql.conf /pub/pg95/* > web2:/pub/pg95/ select pg_stop_backup(); rsync -av --delete > --checksum /pub/pg95/pg_xlog web2:/pub/pg95/ > > That should make sure the copies are exact. > > > -Andy > > > That seemed to have worked. Took forever, but once I started the slave on web2 I was able to login and run queries withoutany errors or problems. Thanks Adrian, -Andy
On 07/03/2016 12:41 PM, Andy Colson wrote: > On 07/03/2016 11:42 AM, Andy Colson wrote: >> >> Now that I think about this more, I think you're on to something. >> I'm trying to get an _exact_ copy of the master db onto the slave. >> Checking rsync man, it matches only on size and modified time, and I >> didn't include deletes. >> >> I'm going to re-try with this: >> >> select pg_start_backup('clone',true); rsync -av --delete --checksum >> --exclude pg_xlog --exclude postgresql.conf /pub/pg95/* >> web2:/pub/pg95/ select pg_stop_backup(); rsync -av --delete >> --checksum /pub/pg95/pg_xlog web2:/pub/pg95/ >> >> That should make sure the copies are exact. >> >> >> -Andy >> >> >> > > That seemed to have worked. Took forever, but once I started the slave > on web2 I was able to login and run queries without any errors or problems. Glad it worked out. > > Thanks Adrian, > > -Andy > > -- Adrian Klaver adrian.klaver@aklaver.com