Thread: [GENERAL] Too long startup time after each crash.

[GENERAL] Too long startup time after each crash.

From
neos@olansoft.com
Date:
Hi.

Perhaps i've sent it to wrong mail-list previously.

After the upgrade postgresql had a strange behavior. After the crash, server launch now take a long time, up to 30-40
minutessometimes. When connecting to the server (as well as in the log), the message - "the database system is starting
up".Startup process consumes 0.4 - 1% CPU. 

Tried different version 9.5.[2,3,4], 9.6.1 (9.4 works fine). On the Internet this problem is not met, perhaps the
problemlies in a combination of different factors (configuration database, Linux kernel), may be my config is wrong? 

Technical details:

Gentoo Linux 4.4.2-hardened
current pg version 9.6.1

## pg config:

listen_addresses = '*'
port = 5432
max_connections = 500
shared_buffers = 2048MB
temp_buffers = 16MB
work_mem = 64MB
maintenance_work_mem = 512MB
max_stack_depth = 4MB
dynamic_shared_memory_type = posix
wal_level = replica
wal_buffers = 8MB
checkpoint_completion_target = 0.85
archive_mode = on
archive_command = 'gzip < %p > /home/pgbasebackup/xlogs/%f.gz'
max_wal_senders = 3
effective_cache_size = 9GB
log_min_duration_statement = 1500
log_timezone = 'Asia/Yekaterinburg'
datestyle = 'iso, dmy'
timezone = 'Asia/Yekaterinburg'
lc_messages = 'ru_RU.UTF-8'
lc_monetary = 'ru_RU.UTF-8'
lc_numeric = 'ru_RU.UTF-8'
lc_time = 'ru_RU.UTF-8'
default_text_search_config = 'pg_catalog.russian'
plperl.on_init = 'use utf8; use re; package utf8; require "utf8_heavy.pl";'

## end of config

Sorry for my eng :)

--
Thanks,

Vitaliy


Re: [GENERAL] Too long startup time after each crash.

From
Adrian Klaver
Date:
On 12/21/2016 02:17 PM, neos@olansoft.com wrote:
> Hi.
>
> Perhaps i've sent it to wrong mail-list previously.
>
> After the upgrade postgresql had a strange behavior. After the crash, server launch now take a long time, up to 30-40
minutessometimes. When connecting to the server (as well as in the log), the message - "the database system is starting
up".Startup process consumes 0.4 - 1% CPU. 

What upgrade and how was it done?

What crash and what caused it?

In particular why are you seeing multiple crashes?

What is the log just before the crash?

What is the log after the the "database system is starting up" part?


>
> Tried different version 9.5.[2,3,4], 9.6.1 (9.4 works fine). On the Internet this problem is not met, perhaps the
problemlies in a combination of different factors (configuration database, Linux kernel), may be my config is wrong? 

Do you mean you searched for this problem description and could not find it?

>
> Technical details:
>
> Gentoo Linux 4.4.2-hardened
> current pg version 9.6.1
>
> ## pg config:
>
> listen_addresses = '*'
> port = 5432
> max_connections = 500
> shared_buffers = 2048MB
> temp_buffers = 16MB
> work_mem = 64MB
> maintenance_work_mem = 512MB
> max_stack_depth = 4MB
> dynamic_shared_memory_type = posix
> wal_level = replica
> wal_buffers = 8MB
> checkpoint_completion_target = 0.85
> archive_mode = on
> archive_command = 'gzip < %p > /home/pgbasebackup/xlogs/%f.gz'
> max_wal_senders = 3
> effective_cache_size = 9GB
> log_min_duration_statement = 1500
> log_timezone = 'Asia/Yekaterinburg'
> datestyle = 'iso, dmy'
> timezone = 'Asia/Yekaterinburg'
> lc_messages = 'ru_RU.UTF-8'
> lc_monetary = 'ru_RU.UTF-8'
> lc_numeric = 'ru_RU.UTF-8'
> lc_time = 'ru_RU.UTF-8'
> default_text_search_config = 'pg_catalog.russian'
> plperl.on_init = 'use utf8; use re; package utf8; require "utf8_heavy.pl";'
>
> ## end of config
>
> Sorry for my eng :)
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] Too long startup time after each crash.

From
neos@olansoft.com
Date:

22.12.2016, 04:21, "Adrian Klaver" <adrian.klaver@aklaver.com>:
> On 12/21/2016 02:17 PM, neos@olansoft.com wrote:
>>  Hi.
>>
>>  Perhaps i've sent it to wrong mail-list previously.
>>
>>  After the upgrade postgresql had a strange behavior. After the crash, server launch now take a long time, up to
30-40minutes sometimes. When connecting to the server (as well as in the log), the message - "the database system is
startingup". Startup process consumes 0.4 - 1% CPU. 
>
> What upgrade and how was it done?

By standard distro method (emerge in my case), and then pg_upgrade utility (according to pg docs). Upgrade was from 9.4
to9.5, and now 9.5.3 to 9.6.1. 

>
> What crash and what caused it?
>
> In particular why are you seeing multiple crashes?

Sorry, I did not specify a crash, i mean technical failure on the server - power failure or sometimes CPU overheating
("home"server problem). 

>
> What is the log just before the crash?

Nothing, as expected, since it's not postgresql crash.

>
> What is the log after the the "database system is starting up" part?

First - "the database system was interrupted; last known up at..."

then many times "database system is starting up" (one per connection attempt)

and after 5-40 min i get - database system was not properly shut down; automatic recovery in progress...redo starts
from...readyto accept connections, and, finally, everything is going well. 

(my logs in ru_RU locale, so do not place them here)

>
>>  Tried different version 9.5.[2,3,4], 9.6.1 (9.4 works fine). On the Internet this problem is not met, perhaps the
problemlies in a combination of different factors (configuration database, Linux kernel), may be my config is wrong? 
>
> Do you mean you searched for this problem description and could not find it?

Yes, it is, my fault, Google translator is not always saving me from mistakes.

>
>>  Technical details:
>>
>>  Gentoo Linux 4.4.2-hardened
>>  current pg version 9.6.1
>>
>>  ## pg config:
>>
>>  listen_addresses = '*'
>>  port = 5432
>>  max_connections = 500
>>  shared_buffers = 2048MB
>>  temp_buffers = 16MB
>>  work_mem = 64MB
>>  maintenance_work_mem = 512MB
>>  max_stack_depth = 4MB
>>  dynamic_shared_memory_type = posix
>>  wal_level = replica
>>  wal_buffers = 8MB
>>  checkpoint_completion_target = 0.85
>>  archive_mode = on
>>  archive_command = 'gzip < %p > /home/pgbasebackup/xlogs/%f.gz'
>>  max_wal_senders = 3
>>  effective_cache_size = 9GB
>>  log_min_duration_statement = 1500
>>  log_timezone = 'Asia/Yekaterinburg'
>>  datestyle = 'iso, dmy'
>>  timezone = 'Asia/Yekaterinburg'
>>  lc_messages = 'ru_RU.UTF-8'
>>  lc_monetary = 'ru_RU.UTF-8'
>>  lc_numeric = 'ru_RU.UTF-8'
>>  lc_time = 'ru_RU.UTF-8'
>>  default_text_search_config = 'pg_catalog.russian'
>>  plperl.on_init = 'use utf8; use re; package utf8; require "utf8_heavy.pl";'
>>
>>  ## end of config
>>
>>  Sorry for my eng :)
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com


Re: [GENERAL] Too long startup time after each crash.

From
Adrian Klaver
Date:
On 12/21/2016 04:16 PM, neos@olansoft.com wrote:
>
>
> 22.12.2016, 04:21, "Adrian Klaver" <adrian.klaver@aklaver.com>:
>> On 12/21/2016 02:17 PM, neos@olansoft.com wrote:
>>>  Hi.
>>>
>>>  Perhaps i've sent it to wrong mail-list previously.
>>>
>>>  After the upgrade postgresql had a strange behavior. After the crash, server launch now take a long time, up to
30-40minutes sometimes. When connecting to the server (as well as in the log), the message - "the database system is
startingup". Startup process consumes 0.4 - 1% CPU. 
>>
>> What upgrade and how was it done?
>
> By standard distro method (emerge in my case), and then pg_upgrade utility (according to pg docs). Upgrade was from
9.4to 9.5, and now 9.5.3 to 9.6.1. 
>
>>
>> What crash and what caused it?
>>
>> In particular why are you seeing multiple crashes?
>
> Sorry, I did not specify a crash, i mean technical failure on the server - power failure or sometimes CPU overheating
("home"server problem). 
>
>>
>> What is the log just before the crash?
>
> Nothing, as expected, since it's not postgresql crash.
>
>>
>> What is the log after the the "database system is starting up" part?
>
> First - "the database system was interrupted; last known up at..."
>
> then many times "database system is starting up" (one per connection attempt)
>
> and after 5-40 min i get - database system was not properly shut down; automatic recovery in progress...redo starts
from...readyto accept connections, and, finally, everything is going well. 

Alright looks like it doing the correct thing.

Now if I am following you say you see the issue starting with 9.5+. As
it so happens that is when checkpoint_segments was replaced with
max_wal_size:

https://www.postgresql.org/docs/9.5/static/release-9-5.html

"

Replace configuration parameter checkpoint_segments with min_wal_size
and max_wal_size (Heikki Linnakangas)

If you previously adjusted checkpoint_segments, the following formula
will give you an approximately equivalent setting:

max_wal_size = (3 * checkpoint_segments) * 16MB

Note that the default setting for max_wal_size is much higher than the
default checkpoint_segments used to be, so adjusting it might no longer
be necessary.
"

If you had not changed the checkpoint_settings in your 9.4-  servers it
would have been set at 3 or roughly 48 MB of WALs. In 9.5 max_wal_size
is set at 1GB. In either case the default  checkpoint_timeout would be 5
minutes.

Did you change the checkpoint_timeout setting or is you machine
generating something close to 1GB of WAL files before the 5 minutes?

Either case could lead to longer start up times as Postgres would have
to process more WAL files.

There is also your checkpoint_completion_target = 0.85. The default is
0.5. Per the docs below:
"The disadvantage of this is that prolonging checkpoints affects
recovery time, because more WAL segments will need to be kept around for
possible use in recovery."

For more information see 9.4-:

https://www.postgresql.org/docs/9.4/static/wal-configuration.html

9.5+:

https://www.postgresql.org/docs/9.5/static/wal-configuration.html

>
> (my logs in ru_RU locale, so do not place them here)
>
>>
>>>  Tried different version 9.5.[2,3,4], 9.6.1 (9.4 works fine). On the Internet this problem is not met, perhaps the
problemlies in a combination of different factors (configuration database, Linux kernel), may be my config is wrong? 
>>
>> Do you mean you searched for this problem description and could not find it?
>
> Yes, it is, my fault, Google translator is not always saving me from mistakes.
>
>>
>>>  Technical details:
>>>
>>>  Gentoo Linux 4.4.2-hardened
>>>  current pg version 9.6.1
>>>
>>>  ## pg config:
>>>
>>>  listen_addresses = '*'
>>>  port = 5432
>>>  max_connections = 500
>>>  shared_buffers = 2048MB
>>>  temp_buffers = 16MB
>>>  work_mem = 64MB
>>>  maintenance_work_mem = 512MB
>>>  max_stack_depth = 4MB
>>>  dynamic_shared_memory_type = posix
>>>  wal_level = replica
>>>  wal_buffers = 8MB
>>>  checkpoint_completion_target = 0.85
>>>  archive_mode = on
>>>  archive_command = 'gzip < %p > /home/pgbasebackup/xlogs/%f.gz'
>>>  max_wal_senders = 3
>>>  effective_cache_size = 9GB
>>>  log_min_duration_statement = 1500
>>>  log_timezone = 'Asia/Yekaterinburg'
>>>  datestyle = 'iso, dmy'
>>>  timezone = 'Asia/Yekaterinburg'
>>>  lc_messages = 'ru_RU.UTF-8'
>>>  lc_monetary = 'ru_RU.UTF-8'
>>>  lc_numeric = 'ru_RU.UTF-8'
>>>  lc_time = 'ru_RU.UTF-8'
>>>  default_text_search_config = 'pg_catalog.russian'
>>>  plperl.on_init = 'use utf8; use re; package utf8; require "utf8_heavy.pl";'
>>>
>>>  ## end of config
>>>
>>>  Sorry for my eng :)
>>
>> --
>> Adrian Klaver
>> adrian.klaver@aklaver.com
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] Too long startup time after each crash.

From
neos@olansoft.com
Date:
22.12.2016, 06:31, "Adrian Klaver" <adrian.klaver@aklaver.com>:
>
> Alright looks like it doing the correct thing.
>
> Now if I am following you say you see the issue starting with 9.5+. As
> it so happens that is when checkpoint_segments was replaced with
> max_wal_size:

Yes, in 9.4 recovery process began instantly(or near that). I thought that was a bug in 9.5.1 and waited for fix.

>
> https://www.postgresql.org/docs/9.5/static/release-9-5.html
>
> "
>
> Replace configuration parameter checkpoint_segments with min_wal_size
> and max_wal_size (Heikki Linnakangas)
>
> If you previously adjusted checkpoint_segments, the following formula
> will give you an approximately equivalent setting:
>
> max_wal_size = (3 * checkpoint_segments) * 16MB
>
> Note that the default setting for max_wal_size is much higher than the
> default checkpoint_segments used to be, so adjusting it might no longer
> be necessary.
> "
>
> If you had not changed the checkpoint_settings in your 9.4- servers it
> would have been set at 3 or roughly 48 MB of WALs. In 9.5 max_wal_size
> is set at 1GB. In either case the default checkpoint_timeout would be 5
> minutes.
>
> Did you change the checkpoint_timeout setting or is you machine
> generating something close to 1GB of WAL files before the 5 minutes?

Hm... No, i have never changed checkpoint_timeout setting (and wal_size in 9.5\9.6 too).

In 9.4 config i have had checkpoint_segments = 32 and checkpoint_completion_target = 0.9
In 9.3 checkpoint_segments = 8 and checkpoint_completion_target = 0.7

Generally i have about 3 to 5 WAL files per 24h (copied to destination by archive_command)

>
> Either case could lead to longer start up times as Postgres would have
> to process more WAL files.
>
> There is also your checkpoint_completion_target = 0.85. The default is
> 0.5. Per the docs below:
> "The disadvantage of this is that prolonging checkpoints affects
> recovery time, because more WAL segments will need to be kept around for
> possible use in recovery."

Hmm, i don't think about that. I set it to "0.5". Now i'm waiting for the next failure, or do it manually nearest night
lol.

>
> For more information see 9.4-:
>
> https://www.postgresql.org/docs/9.4/static/wal-configuration.html
>
> 9.5+:
>
> https://www.postgresql.org/docs/9.5/static/wal-configuration.html
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com


Re: [GENERAL] Too long startup time after each crash.

From
Adrian Klaver
Date:
On 12/21/2016 06:53 PM, neos@olansoft.com wrote:
> 22.12.2016, 06:31, "Adrian Klaver" <adrian.klaver@aklaver.com>:
>>
>> Alright looks like it doing the correct thing.
>>
>> Now if I am following you say you see the issue starting with 9.5+. As
>> it so happens that is when checkpoint_segments was replaced with
>> max_wal_size:
>


>> If you had not changed the checkpoint_settings in your 9.4- servers it
>> would have been set at 3 or roughly 48 MB of WALs. In 9.5 max_wal_size
>> is set at 1GB. In either case the default checkpoint_timeout would be 5
>> minutes.
>>
>> Did you change the checkpoint_timeout setting or is you machine
>> generating something close to 1GB of WAL files before the 5 minutes?
>
> Hm... No, i have never changed checkpoint_timeout setting (and wal_size in 9.5\9.6 too).
>
> In 9.4 config i have had checkpoint_segments = 32 and checkpoint_completion_target = 0.9
> In 9.3 checkpoint_segments = 8 and checkpoint_completion_target = 0.7
>
> Generally i have about 3 to 5 WAL files per 24h (copied to destination by archive_command)

So a max of 80MB over 24hrs, to me that is not enough to make a
difference in recovery time.

>
>>
>> Either case could lead to longer start up times as Postgres would have
>> to process more WAL files.
>>
>> There is also your checkpoint_completion_target = 0.85. The default is
>> 0.5. Per the docs below:
>> "The disadvantage of this is that prolonging checkpoints affects
>> recovery time, because more WAL segments will need to be kept around for
>> possible use in recovery."
>
> Hmm, i don't think about that. I set it to "0.5". Now i'm waiting for the next failure, or do it manually nearest
nightlol. 

Given your slow rate of WAL production I would not expect that to make a
difference.

Next time Postgres is in recovery and assuming you can catch it you
might want to:

ps ax to see what is running.

Run iostat

Check the OS system logs for any clues.

>
>>
>> For more information see 9.4-:
>>
>> https://www.postgresql.org/docs/9.4/static/wal-configuration.html
>>
>> 9.5+:
>>
>> https://www.postgresql.org/docs/9.5/static/wal-configuration.html
>>
>>
>> --
>> Adrian Klaver
>> adrian.klaver@aklaver.com
>


--
Adrian Klaver
adrian.klaver@aklaver.com