Thread: auto truncate/vacuum full
Hi all,
I've got a weird thing on one of my databases this night:
- I've a monthly partition for storing activity logs defined as this:
- mother log table
- one child partition for each month
- Last friday I dumped the last month partition, and tried to truncate it, which locked lots of queries because the mother table was also locked. So I killed the TRUNCATE and performed a DELETE on my partition
- Last night the database locked. pg_log full of messages about insert into the mother table waiting for a lock.
- After about 40min, the waiting queries acquired their locks and ran.
- At the same time, my munin graphs show a decrease of my dbase size, of about the size of my partition, which pg_(total_)relation_size says has a zero size now.
- No slow query mentionning truncate nor vacuum full.
- According to pg_stat_user_tables an autovacuum was run on the table just after the unlock.
So my question are: can the autovacuum daemon perform vacuum full ? Or another internal postgres process ? Could it come from the TRUNCATE I run and canceled 4 days before ?
I'm running postgresql-8.3.7
Regards,
--
JC
Ph'nglui mglw'nafh Cthulhu n'gah Bill R'lyeh Wgah'nagl fhtagn!
I've got a weird thing on one of my databases this night:
- I've a monthly partition for storing activity logs defined as this:
- mother log table
- one child partition for each month
- Last friday I dumped the last month partition, and tried to truncate it, which locked lots of queries because the mother table was also locked. So I killed the TRUNCATE and performed a DELETE on my partition
- Last night the database locked. pg_log full of messages about insert into the mother table waiting for a lock.
- After about 40min, the waiting queries acquired their locks and ran.
- At the same time, my munin graphs show a decrease of my dbase size, of about the size of my partition, which pg_(total_)relation_size says has a zero size now.
- No slow query mentionning truncate nor vacuum full.
- According to pg_stat_user_tables an autovacuum was run on the table just after the unlock.
So my question are: can the autovacuum daemon perform vacuum full ? Or another internal postgres process ? Could it come from the TRUNCATE I run and canceled 4 days before ?
I'm running postgresql-8.3.7
Regards,
--
JC
Ph'nglui mglw'nafh Cthulhu n'gah Bill R'lyeh Wgah'nagl fhtagn!
JC Praud escribió: > So my question are: can the autovacuum daemon perform vacuum full ? Or > another internal postgres process ? Could it come from the TRUNCATE I run > and canceled 4 days before ? No. Autovacuum only issues commands that don't lock tables strongly. I doubt this has anything to do with your old TRUNCATE either. My guess is that somebody else ran TRUNCATE and forgot to tell you; or maybe an automatic external process (cron or some such). -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
2009/10/27 Alvaro Herrera <alvherre@commandprompt.com>: > JC Praud escribió: > >> So my question are: can the autovacuum daemon perform vacuum full ? Or >> another internal postgres process ? Could it come from the TRUNCATE I run >> and canceled 4 days before ? > > No. Autovacuum only issues commands that don't lock tables strongly. I > doubt this has anything to do with your old TRUNCATE either. My guess > is that somebody else ran TRUNCATE and forgot to tell you; or maybe an > automatic external process (cron or some such). look again on code. I checked it today. When is possible to drop some pages from end, then vacuum do lock, and try to remove free pages. I checked it. This process is started from full vacuum, lazy vacuum and autovacuum too. We have similar issue like JC has. Regards Pavel > > -- > Alvaro Herrera http://www.CommandPrompt.com/ > PostgreSQL Replication, Consulting, Custom Development, 24x7 support > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
Pavel Stehule escribió: > 2009/10/27 Alvaro Herrera <alvherre@commandprompt.com>: > > JC Praud escribió: > > > >> So my question are: can the autovacuum daemon perform vacuum full ? Or > >> another internal postgres process ? Could it come from the TRUNCATE I run > >> and canceled 4 days before ? > > > > No. Autovacuum only issues commands that don't lock tables strongly. I > > doubt this has anything to do with your old TRUNCATE either. My guess > > is that somebody else ran TRUNCATE and forgot to tell you; or maybe an > > automatic external process (cron or some such). > > look again on code. I checked it today. When is possible to drop some > pages from end, then vacuum do lock, and try to remove free pages. I > checked it. This process is started from full vacuum, lazy vacuum and > autovacuum too. We have similar issue like JC has. Oh, the truncation bit. Yeah. It uses a conditional lock acquire though, so if the table is already locked it shouldn't try to do anything. Now 40 mins walking those pages to figure out that they need to be truncated, I concede that it's too much. Maybe we shouldn't be doing a backwards scan; perhaps this breaks the OS readahead and make it even slower. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, 27 Oct 2009, Alvaro Herrera wrote: > Now 40 mins walking those pages to figure out that they need to be > truncated, I concede that it's too much. Maybe we shouldn't be doing a > backwards scan; perhaps this breaks the OS readahead and make it even > slower. I've watched that take hours before on a large table after purging hundreds of gigabytes of old historical data. Improvements to speed that up like scanning more efficiently would be welcome. But given the potential for a really bad worst-case here, I have to wonder if this really needs to get broken up into bits with finer locking instead of micromanaging the details. (Yes, I should have been using date-range partitioning instead and just dropped the old partitions, but sometimes these things grow only after you've made design decisions the wrong way) -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > On Tue, 27 Oct 2009, Alvaro Herrera wrote: >> Now 40 mins walking those pages to figure out that they need to be >> truncated, I concede that it's too much. Maybe we shouldn't be doing a >> backwards scan; perhaps this breaks the OS readahead and make it even >> slower. > I've watched that take hours before on a large table after purging > hundreds of gigabytes of old historical data. Shouldn't autovacuum be getting kicked off the lock if anyone does something that conflicts? This seems like it should be a nonproblem if everything is operating as designed. The issue I can see is that we might never be able to complete any truncation if there's a lot of potentially removable pages and a pretty steady flow of conflicting lock attempts. But that would result in failure to remove bloat, not stoppage of conflicting queries. It might be worth allowing autovacuum to execute the truncation every few hundred pages, so as to ensure that progress can be made even if it never gets a very long uninterrupted lock on the table. regards, tom lane
On Tue, 27 Oct 2009, Tom Lane wrote: > The issue I can see is that we might never be able to complete any > truncation if there's a lot of potentially removable pages and a pretty > steady flow of conflicting lock attempts. But that would result in > failure to remove bloat, not stoppage of conflicting queries. That's exactly it. It's possible to get into a situation where it becomes increasingly difficult to even catch up bloat once you get behind by a certain amount. All it takes is one giant deletion and you can get stuck here until there's a window to take the database down altogether, and you could end up down for hours waiting for that to execute. > It might be worth allowing autovacuum to execute the truncation every > few hundred pages, so as to ensure that progress can be made even if it > never gets a very long uninterrupted lock on the table. That was the sort of thing I was alluding to, moving in that direction would seem much more valuable than trying to optimize the existing approach better. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Alvaro Herrera <alvherre@commandprompt.com> writes: > Now 40 mins walking those pages to figure out that they need to be > truncated, I concede that it's too much. Maybe we shouldn't be doing a > backwards scan; perhaps this breaks the OS readahead and make it even > slower. That's very possible, since a backwards scan is guaranteed to destroy any rotational positioning the OS has done --- you'll have to wait at least one disk rotation for each page, whereas a forward scan could hope to do better than that. Maybe we could do this along with my idea of incremental truncation. Scan the last K pages of the relation *forwards*, truncate as appropriate, repeat, until finding a nonempty page. Choosing the max value of K might be a bit tricky. regards, tom lane
Greg Smith <gsmith@gregsmith.com> writes: > On Tue, 27 Oct 2009, Tom Lane wrote: >> The issue I can see is that we might never be able to complete any >> truncation if there's a lot of potentially removable pages and a pretty >> steady flow of conflicting lock attempts. But that would result in >> failure to remove bloat, not stoppage of conflicting queries. > That's exactly it. That might be exactly what's worrying *you*, but my point was that it doesn't appear to explain the OP's complaint. Whatever issue he has got is something else. regards, tom lane
JC Praud escribió: > - Last night the database locked. pg_log full of messages about insert into > the mother table waiting for a lock. This bit does not make much sense to me. A transaction waiting will not show up in the log. Were they cancelled? Can you paste an extract from the log? > - After about 40min, the waiting queries acquired their locks and ran. Do you have a vacuum in cron or something like that? As Tom says, if it had been autovacuum, it should have been cancelled automatically (else we've got a bug); but something invoking vacuum externally wouldn't have, so what you describe is what we would expect. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Oct 27, 2009 at 6:31 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
I suspected that, too. And asked the ninjas we have here ;) I also checked the vacuum cronjobs we have.
None performing full vacuum on this table.
If it was the case, I should have seen the query in the pg_log as a slow query, I guess ?
Regards,
JC Praud escribió:No. Autovacuum only issues commands that don't lock tables strongly. I
> So my question are: can the autovacuum daemon perform vacuum full ? Or
> another internal postgres process ? Could it come from the TRUNCATE I run
> and canceled 4 days before ?
doubt this has anything to do with your old TRUNCATE either. My guess
is that somebody else ran TRUNCATE and forgot to tell you; or maybe an
automatic external process (cron or some such).
I suspected that, too. And asked the ninjas we have here ;) I also checked the vacuum cronjobs we have.
None performing full vacuum on this table.
If it was the case, I should have seen the query in the pg_log as a slow query, I guess ?
Regards,
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
--
JC
Ph'nglui mglw'nafh Cthulhu n'gah Bill R'lyeh Wgah'nagl fhtagn!
Sorry, I got a bit lost in the thread. BTW, Thanks for all the answers :)
--
JC
Ph'nglui mglw'nafh Cthulhu n'gah Bill R'lyeh Wgah'nagl fhtagn!
On Wed, Oct 28, 2009 at 12:29 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
No, the transactions were not cancelled. All I saw in he pg_log is this (taken at the time le lock was lifted):
2009-10-27 05:39:19 CET dbuser dbase 10.33.10.133 LOG: process 14866 still waiting for RowExclusiveLock on relation 26683 of database 23806 after 5000.271 ms
2009-10-27 05:39:19 CET dbuser dbase 10.33.10.133 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
2009-10-27 05:39:19 CET dbuser dbase 10.33.10.133 LOG: process 14048 still waiting for RowExclusiveLock on relation 26683 of database 23806 after 5000.409 ms
2009-10-27 05:39:19 CET dbuser dbase 10.33.10.133 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
2009-10-27 05:39:50 CET dbuser dbase 10.33.10.41 LOG: duration: 62103.487 ms statement: SELECT put_root_files_into_ag_scheduler_delete_files( 8 , 50 );
2009-10-27 05:39:57 CET dbuser dbase 10.33.10.133 LOG: process 14797 still waiting for RowExclusiveLock on relation 26683 of database 23806 after 5000.362 ms
2009-10-27 05:39:57 CET dbuser dbase 10.33.10.133 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.33 LOG: process 18964 acquired RowExclusiveLock on relation 26683 of database 23806 after 3572619.123 ms
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.33 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.133 LOG: process 24284 acquired RowExclusiveLock on relation 26683 of database 23806 after 3572193.509 ms
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.133 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.133 LOG: process 19497 acquired RowExclusiveLock on relation 26683 of database 23806 after 3572005.173 ms
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.133 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
The log_user table is the mother table with rules redirecting the inserts to the monthly partitions.
I was not monitoring the database at the time, all I saw is that pg_stat_user_tables dates the last_autovacuum at 2009-10-27 05:40:09.611129+01 this day on the partition.
We have no cronjobs running vacuum on these monthly partition.
No sign of canceled vacuums neither.
Regards,
JC Praud escribió:This bit does not make much sense to me. A transaction waiting will not
> - Last night the database locked. pg_log full of messages about insert into
> the mother table waiting for a lock.
show up in the log. Were they cancelled? Can you paste an extract from
the log?
No, the transactions were not cancelled. All I saw in he pg_log is this (taken at the time le lock was lifted):
2009-10-27 05:39:19 CET dbuser dbase 10.33.10.133 LOG: process 14866 still waiting for RowExclusiveLock on relation 26683 of database 23806 after 5000.271 ms
2009-10-27 05:39:19 CET dbuser dbase 10.33.10.133 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
2009-10-27 05:39:19 CET dbuser dbase 10.33.10.133 LOG: process 14048 still waiting for RowExclusiveLock on relation 26683 of database 23806 after 5000.409 ms
2009-10-27 05:39:19 CET dbuser dbase 10.33.10.133 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
2009-10-27 05:39:50 CET dbuser dbase 10.33.10.41 LOG: duration: 62103.487 ms statement: SELECT put_root_files_into_ag_scheduler_delete_files( 8 , 50 );
2009-10-27 05:39:57 CET dbuser dbase 10.33.10.133 LOG: process 14797 still waiting for RowExclusiveLock on relation 26683 of database 23806 after 5000.362 ms
2009-10-27 05:39:57 CET dbuser dbase 10.33.10.133 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.33 LOG: process 18964 acquired RowExclusiveLock on relation 26683 of database 23806 after 3572619.123 ms
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.33 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.133 LOG: process 24284 acquired RowExclusiveLock on relation 26683 of database 23806 after 3572193.509 ms
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.133 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.133 LOG: process 19497 acquired RowExclusiveLock on relation 26683 of database 23806 after 3572005.173 ms
2009-10-27 05:40:09 CET dbuser dbase 10.33.10.133 STATEMENT: INSERT INTO log_user (id_user, action, comment, date, id_session, ip_addr) VALUES (...)
The log_user table is the mother table with rules redirecting the inserts to the monthly partitions.
Do you have a vacuum in cron or something like that? As Tom says, if it
> - After about 40min, the waiting queries acquired their locks and ran.
had been autovacuum, it should have been cancelled automatically (else
we've got a bug); but something invoking vacuum externally wouldn't
have, so what you describe is what we would expect.
I was not monitoring the database at the time, all I saw is that pg_stat_user_tables dates the last_autovacuum at 2009-10-27 05:40:09.611129+01 this day on the partition.
We have no cronjobs running vacuum on these monthly partition.
No sign of canceled vacuums neither.
Regards,
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
--
JC
Ph'nglui mglw'nafh Cthulhu n'gah Bill R'lyeh Wgah'nagl fhtagn!
JC Praud escribió: > On Wed, Oct 28, 2009 at 12:29 AM, Alvaro Herrera <alvherre@commandprompt.com > > This bit does not make much sense to me. A transaction waiting will not > > show up in the log. Were they cancelled? Can you paste an extract from > > the log? > > No, the transactions were not cancelled. All I saw in he pg_log is this > (taken at the time le lock was lifted): > > 2009-10-27 05:39:19 CET dbuser dbase 10.33.10.133 LOG: process 14866 still > waiting for RowExclusiveLock on relation 26683 of database 23806 after > 5000.271 ms Oh, you have log_lock_waits enabled. It makes plenty of sense now, thanks. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, Oct 27, 2009 at 6:29 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > > Do you have a vacuum in cron or something like that? As Tom says, if it > had been autovacuum, it should have been cancelled automatically (else > we've got a bug); but something invoking vacuum externally wouldn't > have, so what you describe is what we would expect. > then we have a bug (at least in 8.3, haven't tried in 8.4)... i see this a month ago, an autovacuum blocking a lot of concurrent updates and selects... once i pg_cancel_backend() the autovacuum process the other ones starting to move -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
Jaime Casanova <jcasanov@systemguards.com.ec> writes: > On Tue, Oct 27, 2009 at 6:29 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: >> Do you have a vacuum in cron or something like that? As Tom says, if it >> had been autovacuum, it should have been cancelled automatically (else >> we've got a bug); but something invoking vacuum externally wouldn't >> have, so what you describe is what we would expect. > then we have a bug (at least in 8.3, haven't tried in 8.4)... i see > this a month ago, an autovacuum blocking a lot of concurrent updates > and selects... once i pg_cancel_backend() the autovacuum process the > other ones starting to move Hmm ... actually there is one case where autovac won't allow itself to be kicked off locks, which is if it's performing an anti-wraparound vacuum. Perhaps anti-wraparound vacuums should skip trying to truncate relations? I'm not convinced that that explains Jaime's report though. You'd expect AW vacuums to only happen on mostly-unused tables, not ones that are sufficiently central to an application to result in blocking a lot of queries ... regards, tom lane