Thread: Should a DB vacuum use up a lot of space ?
Hi all,
We’ve got a 16TB database that’s run up against the wrap-around tx id issue (despite running auto-vacuum):
ERROR: database is not accepting commands to avoid wraparound data loss in database "public"
HINT: Stop the postmaster and use a standalone backend to vacuum that database.
The message says to VACUUM the whole database so we’ve dropped down to single user mode and issued a “vacuum verbose” command on the backend
1) Unlike other times I’ve run vacuum verbose, I’m not getting any logging message at all (for the past 9 hours)
2) File system usage is going steadily up (3TB over the past 9 hours). Why is that ? I didn’t expect additional disk usage and we’re going to run into disk space usage if this keeps up.
3) Was a database-wide vacuum required or could I have just figured out which tables needed to be vacuumed and vacuumed those (my understanding was no), for example using http://dba.stackexchange.com/a/35234 and a query such as SELECT c.oid::regclass as table_name, greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind IN ('r', 'm') order by 2 desc limit 10;
4) Assuming I could have just identified individual tables that required vacuuming, can I CTRL-C the current vacuum process without running into issues ?
Thanks
Philippe Girolami
On 08/05/2016 11:57 PM, Philippe Girolami wrote: > Hi all, > > > > We’ve got a 16TB database that’s run up against the wrap-around tx id > issue (despite running auto-vacuum): What Postgres version? > > ERROR: database is not accepting commands to avoid wraparound data > loss in database "public" > > HINT: Stop the postmaster and use a standalone backend to vacuum that > database. > > > > The message says to VACUUM the whole database so we’ve dropped down to > single user mode and issued a “vacuum verbose” command on the backend > > > > 1) Unlike other times I’ve run vacuum verbose, I’m not getting any > logging message at all (for the past 9 hours) In single user mode you need to specify the log file on the command line: https://www.postgresql.org/docs/9.5/static/app-postgres.html -r filename Send all server log output to filename. This option is only honored when supplied as a command-line option. The file will show up the $DATA directory. > > 2) File system usage is going steadily up (3TB over the past 9 > hours). Why is that ? I didn’t expect additional disk usage and we’re > going to run into disk space usage if this keeps up. What was the exact command you gave? > > 3) Was a database-wide vacuum required or could I have just > figured out which tables needed to be vacuumed and vacuumed those (my > understanding was no), for example using > http://dba.stackexchange.com/a/35234 and a query such as > SELECT c.oid::regclass as table_name, > greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age FROM pg_class c > LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind IN ('r', > 'm') order by 2 desc limit 10; See here: https://www.postgresql.org/docs/9.5/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND > > 4) Assuming I could have just identified individual tables that > required vacuuming, can I CTRL-C the current vacuum process without > running into issues ? Running a test on a small database here, a Ctrl-D in the single user session closed it down successfully while running a VACUUM. Seems it comes down to which is the bigger issue, running out of disk space or having to redo the VACUUM. > > > > Thanks > > Philippe Girolami > -- Adrian Klaver adrian.klaver@aklaver.com
Philippe Girolami <philippe.girolami@mosaik.com> writes: > We’ve got a 16TB database that’s run up against the wrap-around tx id issue (despite running auto-vacuum): > ERROR: database is not accepting commands to avoid wraparound data loss in database "public" > HINT: Stop the postmaster and use a standalone backend to vacuum that database. > The message says to VACUUM the whole database so we’ve dropped down to single user mode and issued a “vacuum verbose” commandon the backend > 2) File system usage is going steadily up (3TB over the past 9 > hours). Why is that ? Probably because there's no background process to issue auto-checkpoints and thereby recover WAL space. I'd try doing a control-C on the VACUUM, then a CHECKPOINT, then start over. You might have to vacuum large tables individually and checkpoint after each one. regards, tom lane
Thanks to Tom & Adrian, here’s what happened (my version was 9.1, sorry I forgot to mention it) 1) 10 hours after my email, the VACUUM had used up about 3.5TB but had stopped using up more disk space, it was now “simply”reading data from the file system 2) I attempted to interrupt using CTRL-D to no avail so I interrupted with CTRL-C. That stopped it with a clean message (butdid not relinquish filesystem space) 3) I exited the backend successfully using CTRL-D and relaunched it with the additional “–r” command line argument 4) I ran the query to see which tables were the “oldest” and did not recognize the ones before I started the vacuuming (encouraging!) 5) I ran CHECKPOINT on the backend and got all the disk space back 6) I realized that the message regarding wraparound was no longer an ERROR but a WARNING so I was able to restart postgres“normally” 7) I ran a query based on my previous query to build VACUUM VERBOSE commands on the tables with the oldest transaction idsand wrote it to a text file and then execute that file, I now have tens of millions of transactions back and can restartmy server. I’ll do the rest of the VACUUM maintenance during low-load periods. Cheers Philippe
On 08/06/2016 12:01 PM, Philippe Girolami wrote: > Thanks to Tom & Adrian, here’s what happened (my version was 9.1, sorry I forgot to mention it) > > 1) 10 hours after my email, the VACUUM had used up about 3.5TB but had stopped using up more disk space, it was now “simply”reading data from the file system > 2) I attempted to interrupt using CTRL-D to no avail so I interrupted with CTRL-C. That stopped it with a clean message(but did not relinquish filesystem space) > 3) I exited the backend successfully using CTRL-D and relaunched it with the additional “–r” command line argument > 4) I ran the query to see which tables were the “oldest” and did not recognize the ones before I started the vacuuming(encouraging!) > 5) I ran CHECKPOINT on the backend and got all the disk space back > 6) I realized that the message regarding wraparound was no longer an ERROR but a WARNING so I was able to restart postgres“normally” > 7) I ran a query based on my previous query to build VACUUM VERBOSE commands on the tables with the oldest transactionids and wrote it to a text file and then execute that file, I now have tens of millions of transactions back andcan restart my server. I’ll do the rest of the VACUUM maintenance during low-load periods. Thanks for the feedback it is nice to 'close the loop' on an issue. > > Cheers > Philippe > > -- Adrian Klaver adrian.klaver@aklaver.com
@Adrian, no problem ! I’m sure someone else will run into this and have the same questions, hopefully they’ll find the answers. I am seeing something weird though (again, this is v9.1): after my database became usable again, I started getting the 10Mwarning on template0. So I made it connectable and ran VACUUM FREEZE on it and made it unconnectable again. That resolvethe warning. However, I see the “age” keeps increasing on that database as I ran queries on my own db. Yesterday the age was 32 and nowit’s already 77933902 Is that to be expected ? I didn’t expect it Thanks
On 08/07/2016 12:32 AM, Philippe Girolami wrote: > @Adrian, no problem ! I’m sure someone else will run into this and have the same questions, hopefully they’ll find theanswers. > > I am seeing something weird though (again, this is v9.1): after my database became usable again, I started getting the10M warning on template0. So I made it connectable and ran VACUUM FREEZE on it and made it unconnectable again. That resolvethe warning. > > However, I see the “age” keeps increasing on that database as I ran queries on my own db. Yesterday the age was 32 andnow it’s already 77933902 Just to be sure you are talking about template0? > Is that to be expected ? I didn’t expect it As I understand it; 1) xid's are global to the cluster. 2) age(xid) measures the difference between the latest global xid to whatever xid you supply it. 3) age(datfrozenxid) measures the difference between the minimum value for the table frozen ids in a particular database and the latest global xid. 4) template0 has a datfrozenxid so there is something for age(xid) to compute, it just does not mean anything as long as template0 is really a read-only database. In other words template0 is not actually contributing any transactions to the consumption of the global store of xids. > > Thanks > > > -- Adrian Klaver adrian.klaver@aklaver.com
>> I am seeing something weird though (again, this is v9.1): after my database became usable again, I started getting the10M warning on template0. So I made it connectable and ran VACUUM >>FREEZE on it and made it unconnectable again. That resolve the warning. >> >> However, I see the “age” keeps increasing on that database as I ran queries on my own db. Yesterday the age was 32 andnow it’s already 77933902 >Just to be sure you are talking about template0? Yes, I am >> Is that to be expected ? I didn’t expect it >As I understand it; > > 1) xid's are global to the cluster. > 2) age(xid) measures the difference between the latest global xid to > whatever xid you supply it. > 3) age(datfrozenxid) measures the difference between the minimum value > for the table frozen ids in a particular database and the latest global xid. > 4) template0 has a datfrozenxid so there is something for age(xid) to > compute, it just does not mean anything as long as template0 is really a > read-only database. In other words template0 is not actually > contributing any transactions to the consumption of the global store of > xids. Yes, I understand. I’m just worried that if I see the WARNING for the 100M mark, I’m afraid when it gets to the 1M mark onthat database it will shut down the cluster. More weirdness this afternoon : the wraparound ERROR showed up again even though I have trouble believing I burned throughso many transactions in under a day. But let’s assume I did, here is what I noticed 1) I vacuumed all other databases. For everyone of those, the age went down to 50M instead of zero. Is that normal ? 2) The only database that didn’t work on was template0 (the age did not change). It did work on template1 Should I suspect something fishy going on ? Thanks
More weirdness this afternoon : the wraparound ERROR showed up again even though I have trouble believing I burned through so many transactions in under a day. But let’s assume I did, here is what I noticed 1) I vacuumed all other databases. For everyone of those, the age went down to 50M instead of zero. Is that normal ? 2) The only database that didn’t work on was template0 (the age did not change). It did work on template1 Should I suspect something fishy going on ?
do you have any long running 'idle in transaction' sessions? these would show up in pg_stat_activity, you want to look at now()-xact_start to see the age of the oldest of these. no tuples newer than the oldest xact_start can be vacuumed.
select * from pg_stat_activity where state='idle in transaction';
select count(*), max(now()-xact_start) from pg_stat_activity where state='idle in transaction';
-- john r pierce, recycling bits in santa cruz
On 08/07/2016 10:26 AM, Philippe Girolami wrote: >>> I am seeing something weird though (again, this is v9.1): after my database became usable again, I started getting the10M warning on template0. So I made it connectable and ran VACUUM >>> FREEZE on it and made it unconnectable again. That resolve the warning. >>> >>> However, I see the “age” keeps increasing on that database as I ran queries on my own db. Yesterday the age was 32 andnow it’s already 77933902 >> Just to be sure you are talking about template0? > Yes, I am > >>> Is that to be expected ? I didn’t expect it >> As I understand it; >> >> 1) xid's are global to the cluster. >> 2) age(xid) measures the difference between the latest global xid to >> whatever xid you supply it. >> 3) age(datfrozenxid) measures the difference between the minimum value >> for the table frozen ids in a particular database and the latest global xid. >> 4) template0 has a datfrozenxid so there is something for age(xid) to >> compute, it just does not mean anything as long as template0 is really a >> read-only database. In other words template0 is not actually >> contributing any transactions to the consumption of the global store of >> xids. > Yes, I understand. I’m just worried that if I see the WARNING for the 100M mark, I’m afraid when it gets to the 1M markon that database it will shut down the cluster. > > More weirdness this afternoon : the wraparound ERROR showed up again even though I have trouble believing I burned throughso many transactions in under a day. But let’s assume I did, here is what I noticed > > 1) I vacuumed all other databases. For everyone of those, the age went down to 50M instead of zero. Is that normal ? > 2) The only database that didn’t work on was template0 (the age did not change). It did work on template1 That is expected as template0 is read-only and so VACUUM will not work on it. > > Should I suspect something fishy going on ? Not sure without more information. 1) Can you be specific about your database references? 'That database' is open-ended. 2) Show the actual numbers from your xid queries. Both the raw values and the age() transformed ones. 3) What are your configuration parameters for the variables mentioned in the section below?: https://www.postgresql.org/docs/9.1/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND 4) If you want to get an idea of fast xid's are being created a quick and dirty way is from here: https://www.postgresql.org/docs/9.1/static/functions-info.html txid_current() bigint get current transaction ID Now if you do select txid_current() outside a transaction it will create an xid on its own, still if you repeat it over some interval of time you will get an idea of how fast the server is going through xid's. > > > Thanks > > > -- Adrian Klaver adrian.klaver@aklaver.com
>That is expected as template0 is read-only and so VACUUM will not work >on it. Isn’t template1 the same ? I’m not seeing that behavior on that one >> Should I suspect something fishy going on ? > Not sure without more information. > 1) Can you be specific about your database references? 'That database' > is open-ended. “That database” = the database that’s been causing wrap-around problems since yesterday. It’s called “public” > 2) Show the actual numbers from your xid queries. Both the raw values > and the age() transformed ones. backend> SELECT datname, datfrozenxid, age(datfrozenxid) FROM pg_database; 1: datname (typeid = 19, len = 64, typmod = -1, byval = f) 2: datfrozenxid (typeid = 28, len = 4, typmod = -1, byval = t) 3: age (typeid = 23, len = 4, typmod = -1, byval = t) ---- 1: datname = "template1" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datfrozenxid = "3814003766" (typeid = 28, len = 4, typmod = -1, byval = t) 3: age = "50000394" (typeid = 23, len = 4, typmod = -1, byval = t) ---- 1: datname = "postgres" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datfrozenxid = "3814003765" (typeid = 28, len = 4, typmod = -1, byval = t) 3: age = "50000395" (typeid = 23, len = 4, typmod = -1, byval = t) ---- 1: datname = "public" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datfrozenxid = "1717520404" (typeid = 28, len = 4, typmod = -1, byval = t) 3: age = "2146483756" (typeid = 23, len = 4, typmod = -1, byval = t) ---- 1: datname = "xxxx" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datfrozenxid = "3814003760" (typeid = 28, len = 4, typmod = -1, byval = t) 3: age = "50000400" (typeid = 23, len = 4, typmod = -1, byval = t) ---- 1: datname = "osmtest" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datfrozenxid = "3814003762" (typeid = 28, len = 4, typmod = -1, byval = t) 3: age = "50000398" (typeid = 23, len = 4, typmod = -1, byval = t) ---- 1: datname = "template0" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datfrozenxid = "3732096533" (typeid = 28, len = 4, typmod = -1, byval = t) 3: age = "131907627" (typeid = 23, len = 4, typmod = -1, byval = t) ---- 1: datname = "drupal_prod" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datfrozenxid = "3814003758" (typeid = 28, len = 4, typmod = -1, byval = t) 3: age = "50000402" (typeid = 23, len = 4, typmod = -1, byval = t) ---- 1: datname = "anta" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datfrozenxid = "3814003756" (typeid = 28, len = 4, typmod = -1, byval = t) 3: age = "50000404" (typeid = 23, len = 4, typmod = -1, byval = t) ---- > 3) What are your configuration parameters for the variables mentioned in > the section below?: > https://www.postgresql.org/docs/9.1/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND Should be the default values, I’ve never changed them. backend> show vacuum_freeze_min_age 1: vacuum_freeze_min_age (typeid = 25, len = -1, typmod = -1, byval = f) ---- 1: vacuum_freeze_min_age = "50000000" (typeid = 25, len = -1, typmod = -1, byval = f) backend> show vacuum_freeze_table_age 1: vacuum_freeze_table_age (typeid = 25, len = -1, typmod = -1, byval = f) ---- 1: vacuum_freeze_table_age = "150000000" (typeid = 25, len = -1, typmod = -1, byval = f) backend> show autovacuum_freeze_max_age 1: autovacuum_freeze_max_age (typeid = 25, len = -1, typmod = -1, byval = f) ---- 1: autovacuum_freeze_max_age = "200000000" (typeid = 25, len = -1, typmod = -1, byval = f) backend> show autovacuum_vacuum_threshold 1: autovacuum_vacuum_threshold (typeid = 25, len = -1, typmod = -1, byval = f) ---- 1: autovacuum_vacuum_threshold = "50" (typeid = 25, len = -1, typmod = -1, byval = f) backend> show autovacuum_max_workers 1: autovacuum_max_workers (typeid = 25, len = -1, typmod = -1, byval = f) ---- 1: autovacuum_max_workers = "3" (typeid = 25, len = -1, typmod = -1, byval = f) backend> show autovacuum_vacuum_scale_factor 1: autovacuum_vacuum_scale_factor (typeid = 25, len = -1, typmod = -1, byval = f) ---- 1: autovacuum_vacuum_scale_factor = "0.2" (typeid = 25, len = -1, typmod = -1, byval = f) backend> show autovacuum_vacuum_cost_delay 1: autovacuum_vacuum_cost_delay (typeid = 25, len = -1, typmod = -1, byval = f) ---- 1: autovacuum_vacuum_cost_delay = "20ms" (typeid = 25, len = -1, typmod = -1, byval = f) backend> show autovacuum_vacuum_cost_limit 1: autovacuum_vacuum_cost_limit (typeid = 25, len = -1, typmod = -1, byval = f) ---- 1: autovacuum_vacuum_cost_limit = "-1" (typeid = 25, len = -1, typmod = -1, byval = f) > 4) If you want to get an idea of fast xid's are being created a quick > and dirty way is from here: > https://www.postgresql.org/docs/9.1/static/functions-info.html > txid_current() bigint get current transaction ID > Now if you do select txid_current() outside a transaction it will create > an xid on its own, still if you repeat it over some interval of time you > will get an idea of how fast the server is going through xid's. Well I’m now at a point where that’s not even possible, I have consistently run into the following • I vacuum enough table to get back a couple dozen transactions below the 1M mark • I exit singleuser mode and relaunch the server so I keep on vacuuming the oldest table first by copying • But I don’t even have time to launch my script : as soon as launch the server, the auto-vacuum daemon kicks in and burnsthrough the transactions : I guess it’s not smart enough to start with the oldest tables ? So it looks like I’m going to have to bite the bullet and really vacuum the whole database instead of just the oldest tablesfirst which will impact our production pipelines.
On 08/07/2016 02:55 PM, Philippe Girolami wrote: >> That is expected as template0 is read-only and so VACUUM will not work >> on it. > Isn’t template1 the same ? I’m not seeing that behavior on that one > >>> Should I suspect something fishy going on ? > >> Not sure without more information. >> 1) Can you be specific about your database references? 'That database' >> is open-ended. > “That database” = the database that’s been causing wrap-around problems since yesterday. It’s called “public” > >> 2) Show the actual numbers from your xid queries. Both the raw values >> and the age() transformed ones. > backend> SELECT datname, datfrozenxid, age(datfrozenxid) FROM pg_database; > 1: datname (typeid = 19, len = 64, typmod = -1, byval = f) > 2: datfrozenxid (typeid = 28, len = 4, typmod = -1, byval = t) > 3: age (typeid = 23, len = 4, typmod = -1, byval = t) > ---- > 1: datname = "template1" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: datfrozenxid = "3814003766" (typeid = 28, len = 4, typmod = -1, byval = t) > 3: age = "50000394" (typeid = 23, len = 4, typmod = -1, byval = t) > ---- > 1: datname = "postgres" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: datfrozenxid = "3814003765" (typeid = 28, len = 4, typmod = -1, byval = t) > 3: age = "50000395" (typeid = 23, len = 4, typmod = -1, byval = t) > ---- > 1: datname = "public" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: datfrozenxid = "1717520404" (typeid = 28, len = 4, typmod = -1, byval = t) > 3: age = "2146483756" (typeid = 23, len = 4, typmod = -1, byval = t) > ---- > 1: datname = "xxxx" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: datfrozenxid = "3814003760" (typeid = 28, len = 4, typmod = -1, byval = t) > 3: age = "50000400" (typeid = 23, len = 4, typmod = -1, byval = t) > ---- > 1: datname = "osmtest" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: datfrozenxid = "3814003762" (typeid = 28, len = 4, typmod = -1, byval = t) > 3: age = "50000398" (typeid = 23, len = 4, typmod = -1, byval = t) > ---- > 1: datname = "template0" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: datfrozenxid = "3732096533" (typeid = 28, len = 4, typmod = -1, byval = t) > 3: age = "131907627" (typeid = 23, len = 4, typmod = -1, byval = t) > ---- > 1: datname = "drupal_prod" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: datfrozenxid = "3814003758" (typeid = 28, len = 4, typmod = -1, byval = t) > 3: age = "50000402" (typeid = 23, len = 4, typmod = -1, byval = t) > ---- > 1: datname = "anta" (typeid = 19, len = 64, typmod = -1, byval = f) > 2: datfrozenxid = "3814003756" (typeid = 28, len = 4, typmod = -1, byval = t) > 3: age = "50000404" (typeid = 23, len = 4, typmod = -1, byval = t) > ---- > >> 3) What are your configuration parameters for the variables mentioned in >> the section below?: >> https://www.postgresql.org/docs/9.1/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND > Should be the default values, I’ve never changed them. > > backend> show vacuum_freeze_min_age > 1: vacuum_freeze_min_age (typeid = 25, len = -1, typmod = -1, byval = f) > ---- > 1: vacuum_freeze_min_age = "50000000" (typeid = 25, len = -1, typmod = -1, byval = f) > > backend> show vacuum_freeze_table_age > 1: vacuum_freeze_table_age (typeid = 25, len = -1, typmod = -1, byval = f) > ---- > 1: vacuum_freeze_table_age = "150000000" (typeid = 25, len = -1, typmod = -1, byval = f) > > backend> show autovacuum_freeze_max_age > 1: autovacuum_freeze_max_age (typeid = 25, len = -1, typmod = -1, byval = f) > ---- > 1: autovacuum_freeze_max_age = "200000000" (typeid = 25, len = -1, typmod = -1, byval = f) > > backend> show autovacuum_vacuum_threshold > 1: autovacuum_vacuum_threshold (typeid = 25, len = -1, typmod = -1, byval = f) > ---- > 1: autovacuum_vacuum_threshold = "50" (typeid = 25, len = -1, typmod = -1, byval = f) > > backend> show autovacuum_max_workers > 1: autovacuum_max_workers (typeid = 25, len = -1, typmod = -1, byval = f) > ---- > 1: autovacuum_max_workers = "3" (typeid = 25, len = -1, typmod = -1, byval = f) > > backend> show autovacuum_vacuum_scale_factor > 1: autovacuum_vacuum_scale_factor (typeid = 25, len = -1, typmod = -1, byval = f) > ---- > 1: autovacuum_vacuum_scale_factor = "0.2" (typeid = 25, len = -1, typmod = -1, byval = f) > > backend> show autovacuum_vacuum_cost_delay > 1: autovacuum_vacuum_cost_delay (typeid = 25, len = -1, typmod = -1, byval = f) > ---- > 1: autovacuum_vacuum_cost_delay = "20ms" (typeid = 25, len = -1, typmod = -1, byval = f) > > backend> show autovacuum_vacuum_cost_limit > 1: autovacuum_vacuum_cost_limit (typeid = 25, len = -1, typmod = -1, byval = f) > ---- > 1: autovacuum_vacuum_cost_limit = "-1" (typeid = 25, len = -1, typmod = -1, byval = f) > >> 4) If you want to get an idea of fast xid's are being created a quick >> and dirty way is from here: > >> https://www.postgresql.org/docs/9.1/static/functions-info.html > >> txid_current() bigint get current transaction ID > >> Now if you do select txid_current() outside a transaction it will create >> an xid on its own, still if you repeat it over some interval of time you >> will get an idea of how fast the server is going through xid's. > Well I’m now at a point where that’s not even possible, I have consistently run into the following > • I vacuum enough table to get back a couple dozen transactions below the 1M mark So you are VACUUMing the lesser 'younger' tables? > • I exit singleuser mode and relaunch the server so I keep on vacuuming the oldest table first by copying > • But I don’t even have time to launch my script : as soon as launch the server, the auto-vacuum daemon kicks in and burnsthrough the transactions : I guess it’s not smart enough to start with the oldest tables ? Not understanding; 'the auto-vacuum daemon kicks in and burns through the transactions'. Are you saying it is reclaiming xids for you or using them? If reclaiming that is what is supposed to do and is good thing. Or am I misunderstanding? > > So it looks like I’m going to have to bite the bullet and really vacuum the whole database instead of just the oldest tablesfirst which will impact our production pipelines. Or VACUUM the most heavily used tables in the database. The next question to be asked is; what is creating the transactions and is the transaction rate 'normal' or is there a possibility you have a rogue process or rogue processes in action? > > > -- Adrian Klaver adrian.klaver@aklaver.com
>So you are VACUUMing the lesser 'younger' tables? I VACUUM those with the highest age : SELECT age,array_agg(table_name) FROM (SELECT c.oid::regclass as table_name, greatest(age(c.relfrozenxid),age(t.relfrozenxid))as age FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHEREc.relkind IN ('r', 'm') order by 2 desc limit 1000) tt group by age order by age desc >> • I exit singleuser mode and relaunch the server so I keep on vacuuming the oldest table first by copying >> • But I don’t even have time to launch my script : as soon as launch the server, the auto-vacuum daemon kicks in and burnsthrough the transactions : I guess it’s not smart enough to start with >>the oldest tables ? >Not understanding; 'the auto-vacuum daemon kicks in and burns through >the transactions'. >Are you saying it is reclaiming xids for you or using them? >If reclaiming that is what is supposed to do and is good thing. >Or am I misunderstanding? Here is what the logs show when I do what I described above 1) I got 7 transactions back in single user mode Aug 7 23:40:57 p2 postgres[30376]: [5-1] 2016-08-07 23:40:57 CEST WARNING: database "public" must be vacuumed within 999893transactions Aug 7 23:40:57 p2 postgres[30376]: [5-2] 2016-08-07 23:40:57 CEST HINT: To avoid a database shutdown, execute a database-wideVACUUM in that database. 2) I exit single user mode and restart the database Aug 7 23:41:40 p2 postgres[15457]: [1-1] 2016-08-07 23:41:40 CEST LOG: database system was shut down at 2016-08-07 23:41:32CEST Aug 7 23:41:40 p2 postgres[15458]: [1-1] 2016-08-07 23:41:40 CEST LOG: incomplete startup packet Aug 7 23:41:40 p2 postgres[15459]: [1-1] 2016-08-07 23:41:40 CEST FATAL: the database system is starting up Aug 7 23:41:40 p2 postgres[15457]: [2-1] 2016-08-07 23:41:40 CEST WARNING: database with OID 16385 must be vacuumed within999892 transactions Aug 7 23:41:40 p2 postgres[15457]: [2-2] 2016-08-07 23:41:40 CEST HINT: To avoid a database shutdown, execute a database-wideVACUUM in that database. Aug 7 23:41:40 p2 postgres[15457]: [2-3] #011You might also need to commit or roll back old prepared transactions. Aug 7 23:41:40 p2 postgres[15462]: [1-1] 2016-08-07 23:41:40 CEST LOG: autovacuum launcher started Aug 7 23:41:40 p2 postgres[15447]: [1-1] 2016-08-07 23:41:40 CEST LOG: database system is ready to accept connections 3) but I don’t even have time to run a query to see the state of the databases Aug 7 23:41:52 p2 postgres[15487]: [2-1] 2016-08-07 23:41:52 CEST ERROR: database is not accepting commands to avoid wraparounddata loss in database "public" Aug 7 23:41:52 p2 postgres[15487]: [2-2] 2016-08-07 23:41:52 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. Aug 7 23:41:52 p2 postgres[15487]: [2-3] #011You might also need to commit or roll back old prepared transactions. Aug 7 23:41:52 p2 postgres[15487]: [2-4] 2016-08-07 23:41:52 CEST STATEMENT: SELECT datname, age(datfrozenxid) FROM pg_database Aug 7 23:41:52 p2 postgres[15497]: [2-1] 2016-08-07 23:41:52 CEST ERROR: database is not accepting commands to avoid wraparounddata loss in database "public" Aug 7 23:41:52 p2 postgres[15497]: [2-2] 2016-08-07 23:41:52 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. Aug 7 23:41:52 p2 postgres[15497]: [2-3] #011You might also need to commit or roll back old prepared transactions. Aug 7 23:41:52 p2 postgres[15497]: [2-4] 2016-08-07 23:41:52 CEST STATEMENT: COPY ( SELECT 'VACUUM VERBOSE ' ||c.oid::regclass||';--' as _command, greatest ( age ( c.relfrozenxid ) ,age ( t.relfrozenxid ) ) as age FROM pg_class cLEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind IN ( 'r' , 'm' ) order by 2 desc limit 1000 ) TO STDOUT(FORMAT text) Aug 7 23:41:52 p2 postgres[15518]: [2-1] 2016-08-07 23:41:52 CEST ERROR: database is not accepting commands to avoid wraparounddata loss in database "public" Aug 7 23:41:52 p2 postgres[15518]: [2-2] 2016-08-07 23:41:52 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. Aug 7 23:41:52 p2 postgres[15518]: [2-3] #011You might also need to commit or roll back old prepared transactions. Aug 7 23:41:52 p2 postgres[15518]: [2-4] 2016-08-07 23:41:52 CEST STATEMENT: SELECT datname, age(datfrozenxid) FROM pg_database Aug 7 23:41:53 p2 rsyslogd-2177: imuxsock lost 3512 messages from pid 15476 due to rate-limiting Aug 7 23:41:53 p2 postgres[15476]: [2-1] 2016-08-07 23:41:53 CEST ERROR: database is not accepting commands to avoid wraparounddata loss in database "public" Aug 7 23:41:53 p2 postgres[15476]: [2-2] 2016-08-07 23:41:53 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. Aug 7 23:41:53 p2 postgres[15476]: [2-3] #011You might also need to commit or roll back old prepared transactions. Aug 7 23:41:53 p2 postgres[15476]: [2-4] 2016-08-07 23:41:53 CEST CONTEXT: automatic analyze of table "public.public.aaaaaa" 4) lots of autovacuum failures Aug 7 23:41:55 p2 postgres[15476]: [3-1] 2016-08-07 23:41:55 CEST ERROR: database is not accepting commands to avoid wraparounddata loss in database "public" Aug 7 23:41:55 p2 postgres[15476]: [3-2] 2016-08-07 23:41:55 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. Aug 7 23:41:55 p2 postgres[15476]: [3-3] #011You might also need to commit or roll back old prepared transactions. Aug 7 23:41:55 p2 postgres[15476]: [3-4] 2016-08-07 23:41:55 CEST CONTEXT: automatic analyze of table "public.public.aaaaab" Aug 7 23:41:56 p2 postgres[15476]: [4-1] 2016-08-07 23:41:56 CEST ERROR: database is not accepting commands to avoid wraparounddata loss in database "public" Aug 7 23:41:56 p2 postgres[15476]: [4-2] 2016-08-07 23:41:56 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. Aug 7 23:41:56 p2 postgres[15476]: [4-3] #011You might also need to commit or roll back old prepared transactions. Aug 7 23:41:56 p2 postgres[15476]: [4-4] 2016-08-07 23:41:56 CEST CONTEXT: automatic analyze of table "public.public.aaaaac" Aug 7 23:41:56 p2 postgres[15476]: [5-1] 2016-08-07 23:41:56 CEST ERROR: database is not accepting commands to avoid wraparounddata loss in database "public" Aug 7 23:41:56 p2 postgres[15476]: [5-2] 2016-08-07 23:41:56 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. Aug 7 23:41:56 p2 postgres[15476]: [5-3] #011You might also need to commit or roll back old prepared transactions. (…) >> So it looks like I’m going to have to bite the bullet and really vacuum the whole database instead of just the oldesttables first which will impact our production pipelines. > Or VACUUM the most heavily used tables in the database. Why would that help ? Aren’t the tables with the highest age the problem ? > The next question to be asked is; what is creating the transactions and > is the transaction rate 'normal' or is there a possibility you have a > rogue process or rogue processes in action? That’s always a possibility but I can’t think of what that would be. I was guessing that the auto-vacuum wasn’t vacuuming the tables with the highest age first. Isn’t that a possibility ? Thanks
On 08/08/2016 12:08 AM, Philippe Girolami wrote: >> So you are VACUUMing the lesser 'younger' tables? > I VACUUM those with the highest age : > SELECT age,array_agg(table_name) FROM (SELECT c.oid::regclass as table_name, greatest(age(c.relfrozenxid),age(t.relfrozenxid))as age FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHEREc.relkind IN ('r', 'm') order by 2 desc limit 1000) tt group by age order by age desc > > >>> • I exit singleuser mode and relaunch the server so I keep on vacuuming the oldest table first by copying >>> • But I don’t even have time to launch my script : as soon as launch the server, the auto-vacuum daemon kicks in andburns through the transactions : I guess it’s not smart enough to start with >>> the oldest tables ? >> Not understanding; 'the auto-vacuum daemon kicks in and burns through >> the transactions'. >> Are you saying it is reclaiming xids for you or using them? >> If reclaiming that is what is supposed to do and is good thing. >> Or am I misunderstanding? > Here is what the logs show when I do what I described above > > 1) I got 7 transactions back in single user mode > Aug 7 23:40:57 p2 postgres[30376]: [5-1] 2016-08-07 23:40:57 CEST WARNING: database "public" must be vacuumed within999893 transactions So the above is from when you enter single user mode? > Aug 7 23:40:57 p2 postgres[30376]: [5-2] 2016-08-07 23:40:57 CEST HINT: To avoid a database shutdown, execute a database-wideVACUUM in that database. > I am not seeing what you do in single user mode? > 2) I exit single user mode and restart the database > Aug 7 23:41:40 p2 postgres[15457]: [1-1] 2016-08-07 23:41:40 CEST LOG: database system was shut down at 2016-08-07 23:41:32CEST > Aug 7 23:41:40 p2 postgres[15458]: [1-1] 2016-08-07 23:41:40 CEST LOG: incomplete startup packet > Aug 7 23:41:40 p2 postgres[15459]: [1-1] 2016-08-07 23:41:40 CEST FATAL: the database system is starting up > Aug 7 23:41:40 p2 postgres[15457]: [2-1] 2016-08-07 23:41:40 CEST WARNING: database with OID 16385 must be vacuumed within999892 transactions So you actually lost a transaction. > Aug 7 23:41:40 p2 postgres[15457]: [2-2] 2016-08-07 23:41:40 CEST HINT: To avoid a database shutdown, execute a database-wideVACUUM in that database. > Aug 7 23:41:40 p2 postgres[15457]: [2-3] #011You might also need to commit or roll back old prepared transactions. > Aug 7 23:41:40 p2 postgres[15462]: [1-1] 2016-08-07 23:41:40 CEST LOG: autovacuum launcher started > Aug 7 23:41:40 p2 postgres[15447]: [1-1] 2016-08-07 23:41:40 CEST LOG: database system is ready to accept connections > > 3) but I don’t even have time to run a query to see the state of the databases > Aug 7 23:41:52 p2 postgres[15487]: [2-1] 2016-08-07 23:41:52 CEST ERROR: database is not accepting commands to avoidwraparound data loss in database "public" > Aug 7 23:41:52 p2 postgres[15487]: [2-2] 2016-08-07 23:41:52 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. > Aug 7 23:41:52 p2 postgres[15487]: [2-3] #011You might also need to commit or roll back old prepared transactions. > Aug 7 23:41:52 p2 postgres[15487]: [2-4] 2016-08-07 23:41:52 CEST STATEMENT: SELECT datname, age(datfrozenxid) FROM pg_database > Aug 7 23:41:52 p2 postgres[15497]: [2-1] 2016-08-07 23:41:52 CEST ERROR: database is not accepting commands to avoidwraparound data loss in database "public" > Aug 7 23:41:52 p2 postgres[15497]: [2-2] 2016-08-07 23:41:52 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. > Aug 7 23:41:52 p2 postgres[15497]: [2-3] #011You might also need to commit or roll back old prepared transactions. > Aug 7 23:41:52 p2 postgres[15497]: [2-4] 2016-08-07 23:41:52 CEST STATEMENT: COPY ( SELECT 'VACUUM VERBOSE ' ||c.oid::regclass||';--' as _command, greatest ( age ( c.relfrozenxid ) ,age ( t.relfrozenxid ) ) as age FROM pg_class cLEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind IN ( 'r' , 'm' ) order by 2 desc limit 1000 ) TO STDOUT(FORMAT text) > Aug 7 23:41:52 p2 postgres[15518]: [2-1] 2016-08-07 23:41:52 CEST ERROR: database is not accepting commands to avoidwraparound data loss in database "public" > Aug 7 23:41:52 p2 postgres[15518]: [2-2] 2016-08-07 23:41:52 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. > Aug 7 23:41:52 p2 postgres[15518]: [2-3] #011You might also need to commit or roll back old prepared transactions. > Aug 7 23:41:52 p2 postgres[15518]: [2-4] 2016-08-07 23:41:52 CEST STATEMENT: SELECT datname, age(datfrozenxid) FROM pg_database > Aug 7 23:41:53 p2 rsyslogd-2177: imuxsock lost 3512 messages from pid 15476 due to rate-limiting Hmm I wonder what pid 15476(see below) is trying to do that is blowing out the logging? > Aug 7 23:41:53 p2 postgres[15476]: [2-1] 2016-08-07 23:41:53 CEST ERROR: database is not accepting commands to avoidwraparound data loss in database "public" > Aug 7 23:41:53 p2 postgres[15476]: [2-2] 2016-08-07 23:41:53 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. > Aug 7 23:41:53 p2 postgres[15476]: [2-3] #011You might also need to commit or roll back old prepared transactions. > Aug 7 23:41:53 p2 postgres[15476]: [2-4] 2016-08-07 23:41:53 CEST CONTEXT: automatic analyze of table "public.public.aaaaaa" > > > 4) lots of autovacuum failures It is trying to analyze tables and being refused as are all commands. > Aug 7 23:41:55 p2 postgres[15476]: [3-1] 2016-08-07 23:41:55 CEST ERROR: database is not accepting commands to avoidwraparound data loss in database "public" > Aug 7 23:41:55 p2 postgres[15476]: [3-2] 2016-08-07 23:41:55 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. > Aug 7 23:41:55 p2 postgres[15476]: [3-3] #011You might also need to commit or roll back old prepared transactions. > Aug 7 23:41:55 p2 postgres[15476]: [3-4] 2016-08-07 23:41:55 CEST CONTEXT: automatic analyze of table "public.public.aaaaab" > Aug 7 23:41:56 p2 postgres[15476]: [4-1] 2016-08-07 23:41:56 CEST ERROR: database is not accepting commands to avoidwraparound data loss in database "public" > Aug 7 23:41:56 p2 postgres[15476]: [4-2] 2016-08-07 23:41:56 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. > Aug 7 23:41:56 p2 postgres[15476]: [4-3] #011You might also need to commit or roll back old prepared transactions. > Aug 7 23:41:56 p2 postgres[15476]: [4-4] 2016-08-07 23:41:56 CEST CONTEXT: automatic analyze of table "public.public.aaaaac" > Aug 7 23:41:56 p2 postgres[15476]: [5-1] 2016-08-07 23:41:56 CEST ERROR: database is not accepting commands to avoidwraparound data loss in database "public" > Aug 7 23:41:56 p2 postgres[15476]: [5-2] 2016-08-07 23:41:56 CEST HINT: Stop the postmaster and use a standalone backendto vacuum that database. > Aug 7 23:41:56 p2 postgres[15476]: [5-3] #011You might also need to commit or roll back old prepared transactions. > (…) > >>> So it looks like I’m going to have to bite the bullet and really vacuum the whole database instead of just the oldesttables first which will impact our production pipelines. >> Or VACUUM the most heavily used tables in the database. > Why would that help ? Aren’t the tables with the highest age the problem ? I think we are saying the same thing. The tables that have accumulated the most xid debt. > >> The next question to be asked is; what is creating the transactions and >> is the transaction rate 'normal' or is there a possibility you have a >> rogue process or rogue processes in action? > That’s always a possibility but I can’t think of what that would be. > I was guessing that the auto-vacuum wasn’t vacuuming the tables with the highest age first. Isn’t that a possibility ? Looks to me like it never gets a chance in normal mode to get its work done. As mentioned before you need to create some headroom to work with. The logical place to do that would be in single user mode where other sessions cannot interfere. > > Thanks > > -- Adrian Klaver adrian.klaver@aklaver.com
Adrian, All the logs posted were from syslog (that’s were postgres writes its log on our Ubuntu install) >> 1) I got 7 transactions back in single user mode >> Aug 7 23:40:57 p2 postgres[30376]: [5-1] 2016-08-07 23:40:57 CEST WARNING: database "public" must be vacuumed within999893 transactions > So the above is from when you enter single user mode? Yes >> Aug 7 23:40:57 p2 postgres[30376]: [5-2] 2016-08-07 23:40:57 CEST HINT: To avoid a database shutdown, execute a database-wideVACUUM in that database. >> >I am not seeing what you do in single user mode? I ran SELECT age,array_agg(table_name) FROM (SELECT c.oid::regclass as table_name, greatest(age(c.relfrozenxid),age(t.relfrozenxid))as age FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHEREc.relkind IN ('r', 'm') order by 2 desc limit 1000) tt group by age order by age desc And then one VACUUM per table listed until I got enough transactions back I ended up letting VACUUM run as long as needed in single-user mode (that took another 24h, in addition to the first 30h)and got the max age down to 147M which is great. In the end, I was probably trying to optimize restoring service too much. Anyway, thanks for the help and suggestions. The responses were super important to deal with the disk usage and making sureinterrupting the vacuum wouldn’t cause any problems !
On Mon, Aug 8, 2016 at 12:08 AM, Philippe Girolami <philippe.girolami@mosaik.com> wrote: >>Not understanding; 'the auto-vacuum daemon kicks in and burns through >>the transactions'. >>Are you saying it is reclaiming xids for you or using them? >>If reclaiming that is what is supposed to do and is good thing. >>Or am I misunderstanding? > Here is what the logs show when I do what I described above > > 1) I got 7 transactions back in single user mode > Aug 7 23:40:57 p2 postgres[30376]: [5-1] 2016-08-07 23:40:57 CEST WARNING: database "public" must be vacuumed within999893 transactions > Aug 7 23:40:57 p2 postgres[30376]: [5-2] 2016-08-07 23:40:57 CEST HINT: To avoid a database shutdown, execute a database-wideVACUUM in that database. What do you mean by getting 7 back? 'Back' compared to what? You had 999886 before vacuuming that table? > 2) I exit single user mode and restart the database 999893 is not the number left before hitting the ordinary stop limit, it is the number left before your database becomes corrupted. 999893 is stil past the ordinary stop limit (1,000,000), so there is no point taking it out of single user mode at this point as it would already be in shutdown at the time it starts. > 4) lots of autovacuum failures 9.1.23 (to be released tomorrow-ish) does include a change that might possibly have helped. >> The next question to be asked is; what is creating the transactions and >> is the transaction rate 'normal' or is there a possibility you have a >> rogue process or rogue processes in action? > That’s always a possibility but I can’t think of what that would be. I think you are just misinterpreting what the reported number means, and there is no mysterious transaction consumption going on at all. Cheers, Jeff