Thread: Streaming replication on 9.1-beta2 after pg_restore is very slow
Hi all,
I am experimenting with (synchronous) streaming replication on postgresql 9.1 beta 2 and am having performance problems. I have initially set up an (asynchronous) streaming replication master cluster with postgresql 9.0, which streamed to a single slave cluster. This seemed to work quite well. Then I’ve mostly copied the configuration to a 9.1 beta 2 cluster (master and slave) to see how synchronous replication would behave.
The master cluster, when empty after an initdb (pg_createcluster on debian) seems to properly stream changes to one or more slave clusters when correctly set up. I watch the master and slave with pg_current_xlog_location() on master and pg_last_xlog_receive_location() and pg_last_xlog_replay_location(). It seems that slaves pick up changes, such as a simple database creation, or updating a role password, or adding a role. But when I then do a restoredb on the master, the slaves quickly fall behind and only very slowly catch up (after maybe an hour or something...)
The log on the slave is filled with statements similar to the following:
LOG: streaming replication successfully connected to primary
LOG: record with zero length at 0/9B7A010
FATAL: terminating walreceiver process due to administrator command
The slave log file also contains the following line a number of times (with the numbers of course a bit different every time):
LOG: invalid magic number 0000 in log file 0, segment 9, offset 10878976
The log on the master contains several lines with:
LOG: could not send data to client: Connection reset by peer
Is there possibly a known issue with the beta, or do I have to configure my cluster differently for 9.1? I’m a bit lost, and would appreciate any comments. Below, I’ve added server configurations. I’m running postgresql from debian packages from the experimental suite.
Thanks,
David Hartveld
--
== Master configuration ==
"version";"PostgreSQL 9.1beta2 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Debian 4.6.0-12) 4.6.1 20110608 (prerelease), 64-bit"
"archive_command";"cp %p /walshipping/9.1/sr-master/%f"
"archive_mode";"on"
"bytea_output";"escape"
"client_encoding";"UNICODE"
"external_pid_file";"/var/run/postgresql/9.1-sr-master.pid"
"lc_collate";"en_US.UTF-8"
"lc_ctype";"en_US.UTF-8"
"listen_addresses";"*"
"log_line_prefix";"%t "
"max_connections";"100"
"max_stack_depth";"2MB"
"max_wal_senders";"3"
"port";"5434"
"server_encoding";"UTF8"
"shared_buffers";"96MB"
"ssl";"on"
"synchronous_standby_names";"*"
"TimeZone";"localtime"
"unix_socket_directory";"/var/run/postgresql"
"wal_buffers";"3MB"
"wal_keep_segments";"32"
"wal_level";"hot_standby"
== Slave configuration ==
"version";"PostgreSQL 9.1beta2 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Debian 4.6.0-12) 4.6.1 20110608 (prerelease), 64-bit"
"bytea_output";"escape"
"client_encoding";"UNICODE"
"external_pid_file";"/var/run/postgresql/9.1-sr-slave0.pid"
"hot_standby";"on"
"lc_collate";"en_US.UTF-8"
"lc_ctype";"en_US.UTF-8"
"listen_addresses";"*"
"log_line_prefix";"%t "
"max_connections";"100"
"max_stack_depth";"2MB"
"port";"5434"
"server_encoding";"UTF8"
"shared_buffers";"96MB"
"ssl";"on"
"TimeZone";"localtime"
"unix_socket_directory";"/var/run/postgresql"
"wal_buffers";"3MB"
On Wed, Jul 6, 2011 at 4:54 PM, David Hartveld <David.Hartveld@mendix.com> wrote: > Is there possibly a known issue with the beta, or do I have to configure my > cluster differently for 9.1? Thanks for trying 9.1beta No known bugs, no differences in configuration. You haven't enabled any of the new 9.1 features either so they aren't likely to be at issue. So there's something pretty badly screwed up somewhere, though that looks like pilot error at the moment, sorry. I'd suggest starting again and see if you can get a reproduceable bug. I'd be very grateful if you can narrow things down to produce a tight bug report. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> -----Oorspronkelijk bericht----- > Van: Simon Riggs [mailto:simon@2ndQuadrant.com] > Verzonden: donderdag 7 juli 2011 01:07 > Aan: David Hartveld > CC: pgsql-general@postgresql.org > Onderwerp: Re: [GENERAL] Streaming replication on 9.1-beta2 after pg_restore > is very slow > > On Wed, Jul 6, 2011 at 4:54 PM, David Hartveld <David.Hartveld@mendix.com> > wrote: > > > Is there possibly a known issue with the beta, or do I have to > > configure my cluster differently for 9.1? > > Thanks for trying 9.1beta > > No known bugs, no differences in configuration. > > You haven't enabled any of the new 9.1 features either so they aren't likely to be > at issue. > > So there's something pretty badly screwed up somewhere, though that looks like > pilot error at the moment, sorry. > > I'd suggest starting again and see if you can get a reproduceable bug. > I'd be very grateful if you can narrow things down to produce a tight bug report. I've just submitted bug report 6094, with a complete description of what I have done. The replication stream is reproduciblyvery slow. If you need more information, let me know.
On Thu, Jul 7, 2011 at 1:12 PM, David Hartveld <David.Hartveld@mendix.com> wrote: >> >> On Wed, Jul 6, 2011 at 4:54 PM, David Hartveld <David.Hartveld@mendix.com> >> wrote: >> >> > Is there possibly a known issue with the beta, or do I have to >> > configure my cluster differently for 9.1? >> >> Thanks for trying 9.1beta >> >> No known bugs, no differences in configuration. >> >> You haven't enabled any of the new 9.1 features either so they aren't likely to be >> at issue. >> >> So there's something pretty badly screwed up somewhere, though that looks like >> pilot error at the moment, sorry. >> >> I'd suggest starting again and see if you can get a reproduceable bug. >> I'd be very grateful if you can narrow things down to produce a tight bug report. > > I've just submitted bug report 6094, with a complete description of what I have done. The replication stream is reproduciblyvery slow. If you need more information, let me know. Bug 6094 contains no additional information and there is not yet a confirmed bug. Your output indicates that there is a problem in your replication setup and this is why the slave does not catch up. This is not a performance issue. It is either a bug in replication, or a user configuration issue. Since few things have changed in 9.1 in this area, at the moment the balance of probability is towards user error. If you can provide a more isolated bug report we may be able to investigate. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> > Your output indicates that there is a problem in your replication > > setup and this is why the slave does not catch up. > > > > This is not a performance issue. It is either a bug in replication, or > > a user configuration issue. Since few things have changed in 9.1 in > > this area, at the moment the balance of probability is towards user > > error. If you can provide a more isolated bug report we may be able to > investigate. > > Apologies for the double post, I thought to have understood that in your > previous message. > > I've read the online 9.1 manual and configured the clusters based on that > information (and on the defaults provided by debian). I've attached the > postgresql.conf files I'm using for master and slave. Do you need other > information from my final setup? Log files, configuration files, the SQL script fed > to psql, which shows the slow replication...? I've been looking at my log files on master and slave a bit better, after having set log_min_messages = debug5. I can seethat somehow the master and slave don't properly work together: the slave attempts to send some data ('sending write/flush/apply')(I'm assuming this is the slaves current location in the WAL?) and then 'terminates process due to administratorcommand', while the master is sending data ('write/flush/apply') (the next part of the WAL?), and then 'couldnot send data to the client: Connection reset by peer', after which the server process exits. I'm hoping this providesyou with more information on what is going on. Do point me in the right direction if you need me to investigate further.I have attached two pieces of the master and slave log files, which should correspond w.r.t. their interaction, whereyou can see the above behavior. Hoping that this will bring me a bit closer to a solution or a proper bug report, David Hartveld
On Thu, Jul 7, 2011 at 2:59 PM, David Hartveld <David.Hartveld@mendix.com> wrote: > I've been looking at my log files on master and slave a bit better, after having set log_min_messages = debug5. I can seethat somehow the master and slave don't properly work together: the slave attempts to send some data ('sending write/flush/apply')(I'm assuming this is the slaves current location in the WAL?) and then 'terminates process due to administratorcommand', while the master is sending data ('write/flush/apply') (the next part of the WAL?), and then 'couldnot send data to the client: Connection reset by peer', after which the server process exits. I'm hoping this providesyou with more information on what is going on. Do point me in the right direction if you need me to investigate further.I have attached two pieces of the master and slave log files, which should correspond w.r.t. their interaction, whereyou can see the above behavior. Ah, so synchronous_standby_names is set on the standby. Please reset that so we are operating asynchronously, then rerun tests to see if that avoids the error. You'll probably need to fully re-generate the standby server before doing this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jul 7, 2011 at 3:37 PM, David Hartveld <david.hartveld@mendix.com> wrote: > Other suggestions? If speed is your concern, a little performance tuning might help, judging from your configs. http://www.2ndQuadrant.com/books/ or other sources will help. I'm interested in seeing some ERROR messages from either system, showing their sequence between master/standby. At the moment its not clear what the first error is. Subsequent messages are less interesting. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Looking back, I notice that you built with gcc 4.6.0. At least on Red Hat machines, that gcc has a rather nasty optimization bug that breaks WAL replay, with symptoms that seem to match what you have here --- namely, the replay process quits and has to be restarted every few pages. I'm betting Debian hasn't fixed that bug yet either and so you need this post-beta2 patch: http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a regards, tom lane
On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote: > Looking back, I notice that you built with gcc 4.6.0. At least on Red > Hat machines, that gcc has a rather nasty optimization bug that breaks > WAL replay, with symptoms that seem to match what you have here --- > namely, the replay process quits and has to be restarted every few pages. > I'm betting Debian hasn't fixed that bug yet either and so you need this > post-beta2 patch: > > http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a Do they know about this gcc bug ? Karsten -- GPG key ID E4071346 @ gpg-keyserver.de E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes: > On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote: >> I'm betting Debian hasn't fixed that bug yet either and so you need this >> post-beta2 patch: >> >> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a > Do they know about this gcc bug ? Can't say about Debian in particular, but upstream gcc certainly knows about it. https://bugzilla.redhat.com/show_bug.cgi?id=712480 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390 regards, tom lane
Op 07-07-11 18:23, Tom Lane schreef: > Karsten Hilbert<Karsten.Hilbert@gmx.net> writes: >> On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote: >>> I'm betting Debian hasn't fixed that bug yet either and so you need this >>> post-beta2 patch: >>> >>> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a > >> Do they know about this gcc bug ? > > Can't say about Debian in particular, but upstream gcc certainly knows > about it. > https://bugzilla.redhat.com/show_bug.cgi?id=712480 > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390 I'm guessing this is probably the cause of my problem, then. I'll try to rebuild with an earlier version of GCC, and let you know the results. Thanks to all of you! David
Op 08-07-11 10:19, David Hartveld schreef: > Op 07-07-11 18:23, Tom Lane schreef: >> Karsten Hilbert<Karsten.Hilbert@gmx.net> writes: >>> On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote: >>>> I'm betting Debian hasn't fixed that bug yet either and so you need >>>> this >>>> post-beta2 patch: >>>> >>>> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a >>>> >> >>> Do they know about this gcc bug ? >> >> Can't say about Debian in particular, but upstream gcc certainly knows >> about it. >> https://bugzilla.redhat.com/show_bug.cgi?id=712480 >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390 > > I'm guessing this is probably the cause of my problem, then. I'll try to > rebuild with an earlier version of GCC, and let you know the results. I've rebuilt with debian gcc 4.4.5-8 and am running 9.1 now without any of the above problems. I've reported a bug for debian postgresql-9.1. I've also seen that the current gcc-4.6 version for debian is 4.6.1, so that shouldn't create any problems anymore (or at least the gcc bug is marked as applied in 4.6.1 in the gcc bugtracker). Thanks for all your time, guys! Greetings, David