Thread: [GENERAL] Too long startup time after each crash.
Hi. Perhaps i've sent it to wrong mail-list previously. After the upgrade postgresql had a strange behavior. After the crash, server launch now take a long time, up to 30-40 minutessometimes. When connecting to the server (as well as in the log), the message - "the database system is starting up".Startup process consumes 0.4 - 1% CPU. Tried different version 9.5.[2,3,4], 9.6.1 (9.4 works fine). On the Internet this problem is not met, perhaps the problemlies in a combination of different factors (configuration database, Linux kernel), may be my config is wrong? Technical details: Gentoo Linux 4.4.2-hardened current pg version 9.6.1 ## pg config: listen_addresses = '*' port = 5432 max_connections = 500 shared_buffers = 2048MB temp_buffers = 16MB work_mem = 64MB maintenance_work_mem = 512MB max_stack_depth = 4MB dynamic_shared_memory_type = posix wal_level = replica wal_buffers = 8MB checkpoint_completion_target = 0.85 archive_mode = on archive_command = 'gzip < %p > /home/pgbasebackup/xlogs/%f.gz' max_wal_senders = 3 effective_cache_size = 9GB log_min_duration_statement = 1500 log_timezone = 'Asia/Yekaterinburg' datestyle = 'iso, dmy' timezone = 'Asia/Yekaterinburg' lc_messages = 'ru_RU.UTF-8' lc_monetary = 'ru_RU.UTF-8' lc_numeric = 'ru_RU.UTF-8' lc_time = 'ru_RU.UTF-8' default_text_search_config = 'pg_catalog.russian' plperl.on_init = 'use utf8; use re; package utf8; require "utf8_heavy.pl";' ## end of config Sorry for my eng :) -- Thanks, Vitaliy
On 12/21/2016 02:17 PM, neos@olansoft.com wrote: > Hi. > > Perhaps i've sent it to wrong mail-list previously. > > After the upgrade postgresql had a strange behavior. After the crash, server launch now take a long time, up to 30-40 minutessometimes. When connecting to the server (as well as in the log), the message - "the database system is starting up".Startup process consumes 0.4 - 1% CPU. What upgrade and how was it done? What crash and what caused it? In particular why are you seeing multiple crashes? What is the log just before the crash? What is the log after the the "database system is starting up" part? > > Tried different version 9.5.[2,3,4], 9.6.1 (9.4 works fine). On the Internet this problem is not met, perhaps the problemlies in a combination of different factors (configuration database, Linux kernel), may be my config is wrong? Do you mean you searched for this problem description and could not find it? > > Technical details: > > Gentoo Linux 4.4.2-hardened > current pg version 9.6.1 > > ## pg config: > > listen_addresses = '*' > port = 5432 > max_connections = 500 > shared_buffers = 2048MB > temp_buffers = 16MB > work_mem = 64MB > maintenance_work_mem = 512MB > max_stack_depth = 4MB > dynamic_shared_memory_type = posix > wal_level = replica > wal_buffers = 8MB > checkpoint_completion_target = 0.85 > archive_mode = on > archive_command = 'gzip < %p > /home/pgbasebackup/xlogs/%f.gz' > max_wal_senders = 3 > effective_cache_size = 9GB > log_min_duration_statement = 1500 > log_timezone = 'Asia/Yekaterinburg' > datestyle = 'iso, dmy' > timezone = 'Asia/Yekaterinburg' > lc_messages = 'ru_RU.UTF-8' > lc_monetary = 'ru_RU.UTF-8' > lc_numeric = 'ru_RU.UTF-8' > lc_time = 'ru_RU.UTF-8' > default_text_search_config = 'pg_catalog.russian' > plperl.on_init = 'use utf8; use re; package utf8; require "utf8_heavy.pl";' > > ## end of config > > Sorry for my eng :) > -- Adrian Klaver adrian.klaver@aklaver.com
22.12.2016, 04:21, "Adrian Klaver" <adrian.klaver@aklaver.com>: > On 12/21/2016 02:17 PM, neos@olansoft.com wrote: >> Hi. >> >> Perhaps i've sent it to wrong mail-list previously. >> >> After the upgrade postgresql had a strange behavior. After the crash, server launch now take a long time, up to 30-40minutes sometimes. When connecting to the server (as well as in the log), the message - "the database system is startingup". Startup process consumes 0.4 - 1% CPU. > > What upgrade and how was it done? By standard distro method (emerge in my case), and then pg_upgrade utility (according to pg docs). Upgrade was from 9.4 to9.5, and now 9.5.3 to 9.6.1. > > What crash and what caused it? > > In particular why are you seeing multiple crashes? Sorry, I did not specify a crash, i mean technical failure on the server - power failure or sometimes CPU overheating ("home"server problem). > > What is the log just before the crash? Nothing, as expected, since it's not postgresql crash. > > What is the log after the the "database system is starting up" part? First - "the database system was interrupted; last known up at..." then many times "database system is starting up" (one per connection attempt) and after 5-40 min i get - database system was not properly shut down; automatic recovery in progress...redo starts from...readyto accept connections, and, finally, everything is going well. (my logs in ru_RU locale, so do not place them here) > >> Tried different version 9.5.[2,3,4], 9.6.1 (9.4 works fine). On the Internet this problem is not met, perhaps the problemlies in a combination of different factors (configuration database, Linux kernel), may be my config is wrong? > > Do you mean you searched for this problem description and could not find it? Yes, it is, my fault, Google translator is not always saving me from mistakes. > >> Technical details: >> >> Gentoo Linux 4.4.2-hardened >> current pg version 9.6.1 >> >> ## pg config: >> >> listen_addresses = '*' >> port = 5432 >> max_connections = 500 >> shared_buffers = 2048MB >> temp_buffers = 16MB >> work_mem = 64MB >> maintenance_work_mem = 512MB >> max_stack_depth = 4MB >> dynamic_shared_memory_type = posix >> wal_level = replica >> wal_buffers = 8MB >> checkpoint_completion_target = 0.85 >> archive_mode = on >> archive_command = 'gzip < %p > /home/pgbasebackup/xlogs/%f.gz' >> max_wal_senders = 3 >> effective_cache_size = 9GB >> log_min_duration_statement = 1500 >> log_timezone = 'Asia/Yekaterinburg' >> datestyle = 'iso, dmy' >> timezone = 'Asia/Yekaterinburg' >> lc_messages = 'ru_RU.UTF-8' >> lc_monetary = 'ru_RU.UTF-8' >> lc_numeric = 'ru_RU.UTF-8' >> lc_time = 'ru_RU.UTF-8' >> default_text_search_config = 'pg_catalog.russian' >> plperl.on_init = 'use utf8; use re; package utf8; require "utf8_heavy.pl";' >> >> ## end of config >> >> Sorry for my eng :) > > -- > Adrian Klaver > adrian.klaver@aklaver.com
On 12/21/2016 04:16 PM, neos@olansoft.com wrote: > > > 22.12.2016, 04:21, "Adrian Klaver" <adrian.klaver@aklaver.com>: >> On 12/21/2016 02:17 PM, neos@olansoft.com wrote: >>> Hi. >>> >>> Perhaps i've sent it to wrong mail-list previously. >>> >>> After the upgrade postgresql had a strange behavior. After the crash, server launch now take a long time, up to 30-40minutes sometimes. When connecting to the server (as well as in the log), the message - "the database system is startingup". Startup process consumes 0.4 - 1% CPU. >> >> What upgrade and how was it done? > > By standard distro method (emerge in my case), and then pg_upgrade utility (according to pg docs). Upgrade was from 9.4to 9.5, and now 9.5.3 to 9.6.1. > >> >> What crash and what caused it? >> >> In particular why are you seeing multiple crashes? > > Sorry, I did not specify a crash, i mean technical failure on the server - power failure or sometimes CPU overheating ("home"server problem). > >> >> What is the log just before the crash? > > Nothing, as expected, since it's not postgresql crash. > >> >> What is the log after the the "database system is starting up" part? > > First - "the database system was interrupted; last known up at..." > > then many times "database system is starting up" (one per connection attempt) > > and after 5-40 min i get - database system was not properly shut down; automatic recovery in progress...redo starts from...readyto accept connections, and, finally, everything is going well. Alright looks like it doing the correct thing. Now if I am following you say you see the issue starting with 9.5+. As it so happens that is when checkpoint_segments was replaced with max_wal_size: https://www.postgresql.org/docs/9.5/static/release-9-5.html " Replace configuration parameter checkpoint_segments with min_wal_size and max_wal_size (Heikki Linnakangas) If you previously adjusted checkpoint_segments, the following formula will give you an approximately equivalent setting: max_wal_size = (3 * checkpoint_segments) * 16MB Note that the default setting for max_wal_size is much higher than the default checkpoint_segments used to be, so adjusting it might no longer be necessary. " If you had not changed the checkpoint_settings in your 9.4- servers it would have been set at 3 or roughly 48 MB of WALs. In 9.5 max_wal_size is set at 1GB. In either case the default checkpoint_timeout would be 5 minutes. Did you change the checkpoint_timeout setting or is you machine generating something close to 1GB of WAL files before the 5 minutes? Either case could lead to longer start up times as Postgres would have to process more WAL files. There is also your checkpoint_completion_target = 0.85. The default is 0.5. Per the docs below: "The disadvantage of this is that prolonging checkpoints affects recovery time, because more WAL segments will need to be kept around for possible use in recovery." For more information see 9.4-: https://www.postgresql.org/docs/9.4/static/wal-configuration.html 9.5+: https://www.postgresql.org/docs/9.5/static/wal-configuration.html > > (my logs in ru_RU locale, so do not place them here) > >> >>> Tried different version 9.5.[2,3,4], 9.6.1 (9.4 works fine). On the Internet this problem is not met, perhaps the problemlies in a combination of different factors (configuration database, Linux kernel), may be my config is wrong? >> >> Do you mean you searched for this problem description and could not find it? > > Yes, it is, my fault, Google translator is not always saving me from mistakes. > >> >>> Technical details: >>> >>> Gentoo Linux 4.4.2-hardened >>> current pg version 9.6.1 >>> >>> ## pg config: >>> >>> listen_addresses = '*' >>> port = 5432 >>> max_connections = 500 >>> shared_buffers = 2048MB >>> temp_buffers = 16MB >>> work_mem = 64MB >>> maintenance_work_mem = 512MB >>> max_stack_depth = 4MB >>> dynamic_shared_memory_type = posix >>> wal_level = replica >>> wal_buffers = 8MB >>> checkpoint_completion_target = 0.85 >>> archive_mode = on >>> archive_command = 'gzip < %p > /home/pgbasebackup/xlogs/%f.gz' >>> max_wal_senders = 3 >>> effective_cache_size = 9GB >>> log_min_duration_statement = 1500 >>> log_timezone = 'Asia/Yekaterinburg' >>> datestyle = 'iso, dmy' >>> timezone = 'Asia/Yekaterinburg' >>> lc_messages = 'ru_RU.UTF-8' >>> lc_monetary = 'ru_RU.UTF-8' >>> lc_numeric = 'ru_RU.UTF-8' >>> lc_time = 'ru_RU.UTF-8' >>> default_text_search_config = 'pg_catalog.russian' >>> plperl.on_init = 'use utf8; use re; package utf8; require "utf8_heavy.pl";' >>> >>> ## end of config >>> >>> Sorry for my eng :) >> >> -- >> Adrian Klaver >> adrian.klaver@aklaver.com > -- Adrian Klaver adrian.klaver@aklaver.com
22.12.2016, 06:31, "Adrian Klaver" <adrian.klaver@aklaver.com>: > > Alright looks like it doing the correct thing. > > Now if I am following you say you see the issue starting with 9.5+. As > it so happens that is when checkpoint_segments was replaced with > max_wal_size: Yes, in 9.4 recovery process began instantly(or near that). I thought that was a bug in 9.5.1 and waited for fix. > > https://www.postgresql.org/docs/9.5/static/release-9-5.html > > " > > Replace configuration parameter checkpoint_segments with min_wal_size > and max_wal_size (Heikki Linnakangas) > > If you previously adjusted checkpoint_segments, the following formula > will give you an approximately equivalent setting: > > max_wal_size = (3 * checkpoint_segments) * 16MB > > Note that the default setting for max_wal_size is much higher than the > default checkpoint_segments used to be, so adjusting it might no longer > be necessary. > " > > If you had not changed the checkpoint_settings in your 9.4- servers it > would have been set at 3 or roughly 48 MB of WALs. In 9.5 max_wal_size > is set at 1GB. In either case the default checkpoint_timeout would be 5 > minutes. > > Did you change the checkpoint_timeout setting or is you machine > generating something close to 1GB of WAL files before the 5 minutes? Hm... No, i have never changed checkpoint_timeout setting (and wal_size in 9.5\9.6 too). In 9.4 config i have had checkpoint_segments = 32 and checkpoint_completion_target = 0.9 In 9.3 checkpoint_segments = 8 and checkpoint_completion_target = 0.7 Generally i have about 3 to 5 WAL files per 24h (copied to destination by archive_command) > > Either case could lead to longer start up times as Postgres would have > to process more WAL files. > > There is also your checkpoint_completion_target = 0.85. The default is > 0.5. Per the docs below: > "The disadvantage of this is that prolonging checkpoints affects > recovery time, because more WAL segments will need to be kept around for > possible use in recovery." Hmm, i don't think about that. I set it to "0.5". Now i'm waiting for the next failure, or do it manually nearest night lol. > > For more information see 9.4-: > > https://www.postgresql.org/docs/9.4/static/wal-configuration.html > > 9.5+: > > https://www.postgresql.org/docs/9.5/static/wal-configuration.html > > > -- > Adrian Klaver > adrian.klaver@aklaver.com
On 12/21/2016 06:53 PM, neos@olansoft.com wrote: > 22.12.2016, 06:31, "Adrian Klaver" <adrian.klaver@aklaver.com>: >> >> Alright looks like it doing the correct thing. >> >> Now if I am following you say you see the issue starting with 9.5+. As >> it so happens that is when checkpoint_segments was replaced with >> max_wal_size: > >> If you had not changed the checkpoint_settings in your 9.4- servers it >> would have been set at 3 or roughly 48 MB of WALs. In 9.5 max_wal_size >> is set at 1GB. In either case the default checkpoint_timeout would be 5 >> minutes. >> >> Did you change the checkpoint_timeout setting or is you machine >> generating something close to 1GB of WAL files before the 5 minutes? > > Hm... No, i have never changed checkpoint_timeout setting (and wal_size in 9.5\9.6 too). > > In 9.4 config i have had checkpoint_segments = 32 and checkpoint_completion_target = 0.9 > In 9.3 checkpoint_segments = 8 and checkpoint_completion_target = 0.7 > > Generally i have about 3 to 5 WAL files per 24h (copied to destination by archive_command) So a max of 80MB over 24hrs, to me that is not enough to make a difference in recovery time. > >> >> Either case could lead to longer start up times as Postgres would have >> to process more WAL files. >> >> There is also your checkpoint_completion_target = 0.85. The default is >> 0.5. Per the docs below: >> "The disadvantage of this is that prolonging checkpoints affects >> recovery time, because more WAL segments will need to be kept around for >> possible use in recovery." > > Hmm, i don't think about that. I set it to "0.5". Now i'm waiting for the next failure, or do it manually nearest nightlol. Given your slow rate of WAL production I would not expect that to make a difference. Next time Postgres is in recovery and assuming you can catch it you might want to: ps ax to see what is running. Run iostat Check the OS system logs for any clues. > >> >> For more information see 9.4-: >> >> https://www.postgresql.org/docs/9.4/static/wal-configuration.html >> >> 9.5+: >> >> https://www.postgresql.org/docs/9.5/static/wal-configuration.html >> >> >> -- >> Adrian Klaver >> adrian.klaver@aklaver.com > -- Adrian Klaver adrian.klaver@aklaver.com