Thread: Streaming replication on 9.1-beta2 after pg_restore is very slow

Streaming replication on 9.1-beta2 after pg_restore is very slow

From
David Hartveld
Date:

Hi all,

 

I am experimenting with (synchronous) streaming replication on postgresql 9.1 beta 2 and am having performance problems. I have initially set up an (asynchronous) streaming replication master cluster with postgresql 9.0, which streamed to a single slave cluster. This seemed to work quite well. Then I’ve mostly copied the configuration to a 9.1 beta 2 cluster (master and slave) to see how synchronous replication would behave.

 

The master cluster, when empty after an initdb (pg_createcluster on debian) seems to properly stream changes to one or more slave clusters when correctly set up. I watch the master and slave with pg_current_xlog_location() on master and pg_last_xlog_receive_location() and pg_last_xlog_replay_location(). It seems that slaves pick up changes, such as a simple database creation, or updating a role password, or adding a role. But when I then do a restoredb on the master,  the slaves quickly fall behind and only very slowly catch up (after maybe an hour or something...)

 

The log on the slave is filled with statements similar to the following:

LOG: streaming replication successfully connected to primary

LOG:  record with zero length at 0/9B7A010

FATAL:  terminating walreceiver process due to administrator command

 

The slave log file also contains the following line a number of times (with the numbers of course a bit different every time):

               LOG: invalid magic number 0000 in log file 0, segment 9, offset 10878976

 

The log on the master contains several lines with:

               LOG:  could not send data to client: Connection reset by peer

 

Is there possibly a known issue with the beta, or do I have to configure my cluster differently for 9.1? I’m a bit lost, and would appreciate any comments. Below, I’ve added server configurations. I’m running postgresql from debian packages from the experimental suite.

 

Thanks,

David Hartveld

--

== Master configuration ==

"version";"PostgreSQL 9.1beta2 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Debian 4.6.0-12) 4.6.1 20110608 (prerelease), 64-bit"

"archive_command";"cp %p /walshipping/9.1/sr-master/%f"

"archive_mode";"on"

"bytea_output";"escape"

"client_encoding";"UNICODE"

"external_pid_file";"/var/run/postgresql/9.1-sr-master.pid"

"lc_collate";"en_US.UTF-8"

"lc_ctype";"en_US.UTF-8"

"listen_addresses";"*"

"log_line_prefix";"%t "

"max_connections";"100"

"max_stack_depth";"2MB"

"max_wal_senders";"3"

"port";"5434"

"server_encoding";"UTF8"

"shared_buffers";"96MB"

"ssl";"on"

"synchronous_standby_names";"*"

"TimeZone";"localtime"

"unix_socket_directory";"/var/run/postgresql"

"wal_buffers";"3MB"

"wal_keep_segments";"32"

"wal_level";"hot_standby"

 

== Slave configuration ==

"version";"PostgreSQL 9.1beta2 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Debian 4.6.0-12) 4.6.1 20110608 (prerelease), 64-bit"

"bytea_output";"escape"

"client_encoding";"UNICODE"

"external_pid_file";"/var/run/postgresql/9.1-sr-slave0.pid"

"hot_standby";"on"

"lc_collate";"en_US.UTF-8"

"lc_ctype";"en_US.UTF-8"

"listen_addresses";"*"

"log_line_prefix";"%t "

"max_connections";"100"

"max_stack_depth";"2MB"

"port";"5434"

"server_encoding";"UTF8"

"shared_buffers";"96MB"

"ssl";"on"

"TimeZone";"localtime"

"unix_socket_directory";"/var/run/postgresql"

"wal_buffers";"3MB"

Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

From
Simon Riggs
Date:
On Wed, Jul 6, 2011 at 4:54 PM, David Hartveld
<David.Hartveld@mendix.com> wrote:

> Is there possibly a known issue with the beta, or do I have to configure my
> cluster differently for 9.1?

Thanks for trying 9.1beta

No known bugs, no differences in configuration.

You haven't enabled any of the new 9.1 features either so they aren't
likely to be at issue.

So there's something pretty badly screwed up somewhere, though that
looks like pilot error at the moment, sorry.

I'd suggest starting again and see if you can get a reproduceable bug.
I'd be very grateful if you can narrow things down to produce a tight
bug report.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

From
David Hartveld
Date:
> -----Oorspronkelijk bericht-----
> Van: Simon Riggs [mailto:simon@2ndQuadrant.com]
> Verzonden: donderdag 7 juli 2011 01:07
> Aan: David Hartveld
> CC: pgsql-general@postgresql.org
> Onderwerp: Re: [GENERAL] Streaming replication on 9.1-beta2 after pg_restore
> is very slow
>
> On Wed, Jul 6, 2011 at 4:54 PM, David Hartveld <David.Hartveld@mendix.com>
> wrote:
>
> > Is there possibly a known issue with the beta, or do I have to
> > configure my cluster differently for 9.1?
>
> Thanks for trying 9.1beta
>
> No known bugs, no differences in configuration.
>
> You haven't enabled any of the new 9.1 features either so they aren't likely to be
> at issue.
>
> So there's something pretty badly screwed up somewhere, though that looks like
> pilot error at the moment, sorry.
>
> I'd suggest starting again and see if you can get a reproduceable bug.
> I'd be very grateful if you can narrow things down to produce a tight bug report.

I've just submitted bug report 6094, with a complete description of what I have done. The replication stream is
reproduciblyvery slow. If you need more information, let me know. 

Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

From
Simon Riggs
Date:
On Thu, Jul 7, 2011 at 1:12 PM, David Hartveld
<David.Hartveld@mendix.com> wrote:

>>
>> On Wed, Jul 6, 2011 at 4:54 PM, David Hartveld <David.Hartveld@mendix.com>
>> wrote:
>>
>> > Is there possibly a known issue with the beta, or do I have to
>> > configure my cluster differently for 9.1?
>>
>> Thanks for trying 9.1beta
>>
>> No known bugs, no differences in configuration.
>>
>> You haven't enabled any of the new 9.1 features either so they aren't likely to be
>> at issue.
>>
>> So there's something pretty badly screwed up somewhere, though that looks like
>> pilot error at the moment, sorry.
>>
>> I'd suggest starting again and see if you can get a reproduceable bug.
>> I'd be very grateful if you can narrow things down to produce a tight bug report.
>
> I've just submitted bug report 6094, with a complete description of what I have done. The replication stream is
reproduciblyvery slow. If you need more information, let me know. 

Bug 6094 contains no additional information and there is not yet a
confirmed bug.

Your output indicates that there is a problem in your replication
setup and this is why the slave does not catch up.

This is not a performance issue. It is either a bug in replication, or
a user configuration issue. Since few things have changed in 9.1 in
this area, at the moment the balance of probability is towards user
error. If you can provide a more isolated bug report we may be able to
investigate.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

From
David Hartveld
Date:
> > Your output indicates that there is a problem in your replication
> > setup and this is why the slave does not catch up.
> >
> > This is not a performance issue. It is either a bug in replication, or
> > a user configuration issue. Since few things have changed in 9.1 in
> > this area, at the moment the balance of probability is towards user
> > error. If you can provide a more isolated bug report we may be able to
> investigate.
>
> Apologies for the double post, I thought to have understood that in your
> previous message.
>
> I've read the online 9.1 manual and configured the clusters based on that
> information (and on the defaults provided by debian). I've attached the
> postgresql.conf files I'm using for master and slave. Do you need other
> information from my final setup? Log files, configuration files, the SQL script fed
> to psql, which shows the slow replication...?

I've been looking at my log files on master and slave a bit better, after having set log_min_messages = debug5. I can
seethat somehow the master and slave don't properly work together: the slave attempts to send some data ('sending
write/flush/apply')(I'm assuming this is the slaves current location in the WAL?) and then 'terminates process due to
administratorcommand', while the master is sending data ('write/flush/apply') (the next part of the WAL?), and then
'couldnot send data to the client: Connection reset by peer', after which the server process exits. I'm hoping this
providesyou with more information on what is going on. Do point me in the right direction if you need me to investigate
further.I have attached two pieces of the master and slave log files, which should correspond w.r.t. their interaction,
whereyou can see the above behavior. 

Hoping that this will bring me a bit closer to a solution or a proper bug report,
David Hartveld


Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

From
Simon Riggs
Date:
On Thu, Jul 7, 2011 at 2:59 PM, David Hartveld
<David.Hartveld@mendix.com> wrote:

> I've been looking at my log files on master and slave a bit better, after having set log_min_messages = debug5. I can
seethat somehow the master and slave don't properly work together: the slave attempts to send some data ('sending
write/flush/apply')(I'm assuming this is the slaves current location in the WAL?) and then 'terminates process due to
administratorcommand', while the master is sending data ('write/flush/apply') (the next part of the WAL?), and then
'couldnot send data to the client: Connection reset by peer', after which the server process exits. I'm hoping this
providesyou with more information on what is going on. Do point me in the right direction if you need me to investigate
further.I have attached two pieces of the master and slave log files, which should correspond w.r.t. their interaction,
whereyou can see the above behavior. 

Ah, so synchronous_standby_names is set on the standby.

Please reset that so we are operating asynchronously, then rerun tests
to see if that avoids the error. You'll probably need to fully
re-generate the standby server before doing this.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

From
Simon Riggs
Date:
On Thu, Jul 7, 2011 at 3:37 PM, David Hartveld
<david.hartveld@mendix.com> wrote:

> Other suggestions?

If speed is your concern, a little performance tuning might help,
judging from your configs.
http://www.2ndQuadrant.com/books/ or other sources will help.

I'm interested in seeing some ERROR messages from either system,
showing their sequence between master/standby. At the moment its not
clear what the first error is. Subsequent messages are less
interesting.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

From
Tom Lane
Date:
Looking back, I notice that you built with gcc 4.6.0.  At least on Red
Hat machines, that gcc has a rather nasty optimization bug that breaks
WAL replay, with symptoms that seem to match what you have here ---
namely, the replay process quits and has to be restarted every few pages.
I'm betting Debian hasn't fixed that bug yet either and so you need this
post-beta2 patch:

http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a

            regards, tom lane

Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

From
Karsten Hilbert
Date:
On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote:

> Looking back, I notice that you built with gcc 4.6.0.  At least on Red
> Hat machines, that gcc has a rather nasty optimization bug that breaks
> WAL replay, with symptoms that seem to match what you have here ---
> namely, the replay process quits and has to be restarted every few pages.
> I'm betting Debian hasn't fixed that bug yet either and so you need this
> post-beta2 patch:
>
> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a

Do they know about this gcc bug ?

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

From
Tom Lane
Date:
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:
> On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote:
>> I'm betting Debian hasn't fixed that bug yet either and so you need this
>> post-beta2 patch:
>>
>> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a

> Do they know about this gcc bug ?

Can't say about Debian in particular, but upstream gcc certainly knows
about it.
https://bugzilla.redhat.com/show_bug.cgi?id=712480
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390

            regards, tom lane

Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

From
David Hartveld
Date:
Op 07-07-11 18:23, Tom Lane schreef:
> Karsten Hilbert<Karsten.Hilbert@gmx.net>  writes:
>> On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote:
>>> I'm betting Debian hasn't fixed that bug yet either and so you need this
>>> post-beta2 patch:
>>>
>>> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a
>
>> Do they know about this gcc bug ?
>
> Can't say about Debian in particular, but upstream gcc certainly knows
> about it.
> https://bugzilla.redhat.com/show_bug.cgi?id=712480
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390

I'm guessing this is probably the cause of my problem, then. I'll try to
rebuild with an earlier version of GCC, and let you know the results.

Thanks to all of you!
David

Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

From
David Hartveld
Date:
Op 08-07-11 10:19, David Hartveld schreef:
> Op 07-07-11 18:23, Tom Lane schreef:
>> Karsten Hilbert<Karsten.Hilbert@gmx.net> writes:
>>> On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote:
>>>> I'm betting Debian hasn't fixed that bug yet either and so you need
>>>> this
>>>> post-beta2 patch:
>>>>
>>>> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a
>>>>
>>
>>> Do they know about this gcc bug ?
>>
>> Can't say about Debian in particular, but upstream gcc certainly knows
>> about it.
>> https://bugzilla.redhat.com/show_bug.cgi?id=712480
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390
>
> I'm guessing this is probably the cause of my problem, then. I'll try to
> rebuild with an earlier version of GCC, and let you know the results.

I've rebuilt with debian gcc 4.4.5-8 and am running 9.1 now without any
of the above problems. I've reported a bug for debian postgresql-9.1.
I've also seen that the current gcc-4.6 version for debian is 4.6.1, so
that shouldn't create any problems anymore (or at least the gcc bug is
marked as applied in 4.6.1 in the gcc bugtracker).

Thanks for all your time, guys!
Greetings, David