Thread: [GENERAL] Autovacuum stuck for hours, blocking queries

[GENERAL] Autovacuum stuck for hours, blocking queries

From

Tim Bellis

Date:

15 February 2017, 20:30:48

I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get blocked
indefinitely(at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should never
takeany blocking locks for any significant period of time, and so would like help resolving the issue. 

The process blocking the query is:
postgres 21985 11304 98 Feb13 ?        1-14:20:52 postgres: autovacuum worker process   <db_name>
which is running the query
autovacuum: VACUUM public.<table_name>

The query being blocked is:
ALTER TABLE <table_name> ALTER COLUMN <column_name> DROP DEFAULT
(But I have seen this previously with other queries being blocked. I used the SQL in
https://wiki.postgresql.org/wiki/Lock_Monitoringto determine which queries were blocked) 

Notes:
  - This database table is used for about 6 million row writes per day, all of which are then deleted at the end of the
day.
  - We have a reasonable number of systems running identical databases and near identical load profiles, and this
problemhas only been seen on two of those systems. It occurs intermittently but reliably (once every few days). It
persistsuntil postgres is restarted (usually accomplished by a reboot) or the autovacuum thread is forcibly killed.
Althoughit often happens again after a short period of time when the autovacuum worker process starts. 
  - I don't have data on whether this vacuum ever finishes - it certainly takes longer than a few hours.
  - Our application kicks off a manual vacuum against this table each night which doesn't hit this problem, as far as
we'reaware. 
  - I have sometimes seen this behaviour when there was database corruption. In that case I see logs like the
following.But it also occurs when there are no logs indicating database corruption (which is the purpose of this email) 
    LOG:  no left sibling (concurrent deletion?) in "<index_name>"
    ERROR:  right sibling's left-link doesn't match: block 41 links to 127 instead of expected 79 in index
"<index_name>"

Things that I have tried:
  - Printing out backtrace a few times from connecting to that process with gdb (see below)
    - This indicates that the vacuum process isn't stuck, but I don't know what else it says
  - Reading the FAQ and release notes of other 9.3.x versions to see whether this looks like a known issue. I couldn't
seeanything obvious apart from an issue about 0000 files (which didn't apply to my system - there was SQL in the
releasenote to check). 
  - Forcibly killing the autovacuum thread (obviously not a good solution to the problem!). This causes the ALTER TABLE
queryto complete correctly. 

Questions:
  - Is this expected? I was under the impression that vacuum should never take any blocking locks for any significant
periodof time. 
  - Should I report this as a bug?
  - Should I change some config options to avoid this? (I know turning off autovacuum would avoid it, but that's not
goodpractice) I've put the vacuum settings for my database below. 
  - Should I avoid certain queries which autovacuum is happening?
  - Are there more diags that I should gather to help diagnose this issue?

Thanks!

Tim

--System details--
Postgres 9.3.4 running on Linux x64 VMWare virtual machine. 24 CPUs of 2.7 GHz. 48GB RAM
It is also running another application (so the database is not the only thing on the server).


---Backtrace (from gdb) ---

#0  0x00000000007328c3 in hash_search_with_hash_value ()
#1  0x00000000006369ba in BufTableLookup ()
#2  0x0000000000639192 in ?? ()
#3  0x0000000000639ade in ReadBufferExtended ()
#4  0x00000000004906d9 in _bt_getbuf ()
#5  0x00000000004910e1 in _bt_pagedel ()
#6  0x00000000004924d1 in ?? ()
#7  0x00000000004926da in ?? ()
#8  0x00000000004928fa in btbulkdelete ()
#9  0x000000000072cdbd in FunctionCall4Coll ()
#10 0x000000000048cf58 in index_bulk_delete ()
#11 0x000000000057bb85 in ?? ()
#12 0x000000000057cfe9 in lazy_vacuum_rel ()
#13 0x000000000057af56 in ?? ()
#14 0x000000000057b28c in vacuum ()
#15 0x000000000060c8fa in ?? ()
#16 0x000000000060cd96 in ?? ()
#17 0x000000000060ce66 in StartAutoVacWorker ()
#18 0x0000000000617602 in ?? ()
#19 <signal handler called>
#20 0x00007f9210c9d393 in __select_nocancel () from /lib64/libc.so.6
#21 0x0000000000618add in PostmasterMain ()
#22 0x00000000005b58d0 in main ()

#0  0x0000000000491107 in _bt_pagedel ()
#1  0x00000000004924d1 in ?? ()
#2  0x00000000004926da in ?? ()
#3  0x00000000004928fa in btbulkdelete ()
...

#0  0x000000000047a1ef in hash_any ()
#1  0x00000000007336e9 in tag_hash ()
#2  0x000000000063916c in ?? ()
#3  0x0000000000639ade in ReadBufferExtended ()
#4  0x00000000004906d9 in _bt_getbuf ()
#5  0x00000000004910e1 in _bt_pagedel ()
#6  0x00000000004924d1 in ?? ()
#7  0x00000000004926da in ?? ()
#8  0x00000000004928fa in btbulkdelete ()
...

---Vacuum settings in postgresql.conf---

# - Cost-Based Vacuum Delay -

vacuum_cost_delay = 10                  # 0-100 milliseconds
#vacuum_cost_page_hit = 1               # 0-10000 credits
#vacuum_cost_page_miss = 10             # 0-10000 credits
#vacuum_cost_page_dirty = 20            # 0-10000 credits
vacuum_cost_limit = 200         # 1-10000 credits

#vacuum_defer_cleanup_age = 0   # number of xacts by which cleanup is delayed

#------------------------------------------------------------------------------
# AUTOVACUUM PARAMETERS
#------------------------------------------------------------------------------

autovacuum = on                        # Enable autovacuum subprocess?  'on'
                                        # requires track_counts to also be on.
#log_autovacuum_min_duration = -1       # -1 disables, 0 logs all actions and
                                        # their durations, > 0 logs only
                                        # actions running at least this number
                                        # of milliseconds.
#autovacuum_max_workers = 3             # max number of autovacuum subprocesses
                                        # (change requires restart)
#autovacuum_naptime = 1min              # time between autovacuum runs
#autovacuum_vacuum_threshold = 50       # min number of row updates before
                                        # vacuum
#autovacuum_analyze_threshold = 50      # min number of row updates before
                                        # analyze
#autovacuum_vacuum_scale_factor = 0.2   # fraction of table size before vacuum
#autovacuum_analyze_scale_factor = 0.1  # fraction of table size before analyze
#autovacuum_freeze_max_age = 200000000  # maximum XID age before forced vacuum
                                        # (change requires restart)
#autovacuum_multixact_freeze_max_age = 400000000        # maximum Multixact age
                                        # before forced vacuum
                                        # (change requires restart)
#autovacuum_vacuum_cost_delay = 20ms    # default vacuum cost delay for
                                        # autovacuum, in milliseconds;
                                        # -1 means use vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1      # default vacuum cost limit for
                                        # autovacuum, -1 means use
                                        # vacuum_cost_limit

#vacuum_freeze_min_age = 50000000
#vacuum_freeze_table_age = 150000000
#vacuum_multixact_freeze_min_age = 5000000
#vacuum_multixact_freeze_table_age = 150000000

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Adrian Klaver

Date:

15 February 2017, 21:27:56

On 02/15/2017 09:30 AM, Tim Bellis wrote:
> I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get
blockedindefinitely (at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should
nevertake any blocking locks for any significant period of time, and so would like help resolving the issue. 
>
> The process blocking the query is:
> postgres 21985 11304 98 Feb13 ?        1-14:20:52 postgres: autovacuum worker process   <db_name>
> which is running the query
> autovacuum: VACUUM public.<table_name>
>
> The query being blocked is:
> ALTER TABLE <table_name> ALTER COLUMN <column_name> DROP DEFAULT
> (But I have seen this previously with other queries being blocked. I used the SQL in
https://wiki.postgresql.org/wiki/Lock_Monitoringto determine which queries were blocked) 
>

Other ALTER TABLE queries?

If so I believe this might apply:

https://www.postgresql.org/docs/9.5/static/explicit-locking.html

SHARE UPDATE EXCLUSIVE

     Conflicts with the SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW
EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE lock modes. This mode
protects a table against concurrent schema changes and VACUUM runs.

     Acquired by VACUUM (without FULL), ANALYZE, CREATE INDEX
CONCURRENTLY, and ALTER TABLE VALIDATE and other ALTER TABLE variants
(for full details see ALTER TABLE).


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Scott Marlowe

Date:

16 February 2017, 01:26:04

On Wed, Feb 15, 2017 at 10:30 AM, Tim Bellis <Tim.Bellis@metaswitch.com> wrote:
> I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get
blockedindefinitely (at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should
nevertake any blocking locks for any significant period of time, and so would like help resolving the issue. 
>
> The process blocking the query is:
> postgres 21985 11304 98 Feb13 ?        1-14:20:52 postgres: autovacuum worker process   <db_name>
> which is running the query
> autovacuum: VACUUM public.<table_name>
>
> The query being blocked is:
> ALTER TABLE <table_name> ALTER COLUMN <column_name> DROP DEFAULT
> (But I have seen this previously with other queries being blocked. I used the SQL in
https://wiki.postgresql.org/wiki/Lock_Monitoringto determine which queries were blocked) 

Yup, there's a priority inversion in DDL, DML and maintenance
(vacuum). Vacuum runs slow in the background. Normal
update/delete/insert work fine because of the type of lock vacuum has.

Then a ddl gets in line. It has to wait on the vacuum, and the vacuum,
set to run super slow. And everybody waits. On vacuum.

Basically it's bad practice to alter tables that are big and being
worked on, because one way or another you're going to pay a price.

I've used partitions for logging and auditing that autocreate and drop
and vacuum, but they never get ddl done on them when they're getting
updated and vice versa.

There are also ways of making the table less likely / not likely /
will not get vacuum automatically. If you're willing to schedule ddl
and vacuum on your own you can then mix the two in relative safety.

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Scott Marlowe

Date:

16 February 2017, 01:30:57

On Wed, Feb 15, 2017 at 3:26 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Wed, Feb 15, 2017 at 10:30 AM, Tim Bellis <Tim.Bellis@metaswitch.com> wrote:
>> I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get
blockedindefinitely (at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should
nevertake any blocking locks for any significant period of time, and so would like help resolving the issue. 
>>
>> The process blocking the query is:
>> postgres 21985 11304 98 Feb13 ?        1-14:20:52 postgres: autovacuum worker process   <db_name>
>> which is running the query
>> autovacuum: VACUUM public.<table_name>
>>
>> The query being blocked is:
>> ALTER TABLE <table_name> ALTER COLUMN <column_name> DROP DEFAULT
>> (But I have seen this previously with other queries being blocked. I used the SQL in
https://wiki.postgresql.org/wiki/Lock_Monitoringto determine which queries were blocked) 
>
> There are also ways of making the table less likely / not likely /
> will not get vacuum automatically. If you're willing to schedule ddl
> and vacuum on your own you can then mix the two in relative safety.

Followup: https://www.postgresql.org/docs/9.3/static/sql-createtable.html#SQL-CREATETABLE-STORAGE-PARAMETERS

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Alvaro Herrera

Date:

16 February 2017, 01:33:14

Scott Marlowe wrote:

> Then a ddl gets in line. It has to wait on the vacuum, and the vacuum,
> set to run super slow. And everybody waits. On vacuum.

Note that this is normally not seen, because autovacuum cancels itself
when somebody is blocked behind it -- until the table reaches the
freeze_max_age limit, and then autovacuum is a for-wraparound one that
is no longer terminated, and then everybody has to wait on it.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Tim Bellis

Date:

16 February 2017, 19:45:35

Thank you all - that's really useful :-)

The other query that gets blocked behind the vacuum is the below (truncated).

This query is generated by jdbc in this method:
org.postgresql.jdbc2.AbstractJdbc2DatabaseMetaData.getIndexInfo(AbstractJdbc2DatabaseMetaData.java:4023)

Even though this is a read only query, is it also expected to be blocked behind the vacuum? Is there a way of getting
indexesfor a table which won't be blocked behind a vacuum? 

Thank you all again,

Tim

SELECT NULL AS TABLE_CAT, n.nspname AS TABLE_SCHEM,   ct.relname AS TABLE_NAME, NOT i.indisunique AS NON_UNIQUE,   NULL
ASINDEX_QUALIFIER, ci.relname AS INDEX_NAME,   CASE i.indisclustered     WHEN true THEN 1    ELSE CASE am.amname
WHEN'hash' THEN 2      ELSE 3    END   END AS TYPE,   (i.keys).n AS ORDINAL_POSITION,
pg_catalog.pg_get_indexdef(ci.oid,(i.keys).n, false) AS COLUMN_NAME,   CASE am.amcanorder     WHEN true THEN CASE
i.indoption[(i.keys).n- 1] & 1       WHEN 1 THEN 'D'       ELSE 'A'     END     ELSE NULL   END AS ASC_OR_DESC,
ci.reltuplesAS CARDINALITY,   ci.relpages AS PAGES,   pg_catalog.pg_get_expr(i.indpred, i.indrelid) AS FILTER_CONDITION
FROMpg_catalog.pg_class ct   JOIN pg_catalog.pg_namespace n ON (ct.relnamespace = n.oid)   JOIN (SELECT i.indexrelid,
i.indrelid,i.indoption,           i.indisunique, i.indisclustered, i.indpred,           i.indexprs,
information_schema._pg_expandarray(i.indkey)AS keys         FROM pg_catalog.pg_index i) i     ON (ct.oid = i.ind

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Adrian Klaver

Date:

16 February 2017, 20:54:21

On 02/16/2017 08:45 AM, Tim Bellis wrote:
> Thank you all - that's really useful :-)
>
> The other query that gets blocked behind the vacuum is the below (truncated).
>
> This query is generated by jdbc in this method:
> org.postgresql.jdbc2.AbstractJdbc2DatabaseMetaData.getIndexInfo(AbstractJdbc2DatabaseMetaData.java:4023)

Is JDBC doing anything else before issuing this?

>
> Even though this is a read only query, is it also expected to be blocked behind the vacuum? Is there a way of getting
indexesfor a table which won't be blocked behind a vacuum?
 

Table 13.2 here:

https://www.postgresql.org/docs/9.5/static/explicit-locking.html

shows the conflicts with SHARE UPDATE EXCLUSIVE(vacuum).

pg_locks:

https://www.postgresql.org/docs/9.5/static/view-pg-locks.html

shows locks being held. So next time it happens I would take a look and 
see if you can work backwards from there.

You could directly access the index information using:

https://www.postgresql.org/docs/9.6/static/catalog-pg-index.html
https://www.postgresql.org/docs/9.6/static/catalog-pg-class.html

>
> Thank you all again,
>
> Tim
>
> SELECT NULL AS TABLE_CAT, n.nspname AS TABLE_SCHEM,   ct.relname AS TABLE_NAME, NOT i.indisunique AS NON_UNIQUE,
NULLAS INDEX_QUALIFIER, ci.relname AS INDEX_NAME,   CASE i.indisclustered     WHEN true THEN 1    ELSE CASE am.amname
   WHEN 'hash' THEN 2      ELSE 3    END   END AS TYPE,   (i.keys).n AS ORDINAL_POSITION,
pg_catalog.pg_get_indexdef(ci.oid,(i.keys).n, false) AS COLUMN_NAME,   CASE am.amcanorder     WHEN true THEN CASE
i.indoption[(i.keys).n- 1] & 1       WHEN 1 THEN 'D'       ELSE 'A'     END     ELSE NULL   END AS ASC_OR_DESC,
ci.reltuplesAS CARDINALITY,   ci.relpages AS PAGES,   pg_catalog.pg_get_expr(i.indpred, i.indrelid) AS FILTER_CONDITION
FROMpg_catalog.pg_class ct   JOIN pg_catalog.pg_namespace n ON (ct.relnamespace = n.oid)   JOIN (SELECT i.indexrelid,
i.indrelid,i.indoption,           i.indisunique, i.indisclustered, i.indpred,           i.indexprs,
information_schema._pg_expandarray(i.indkey)AS keys         FROM pg_catalog.pg_index i) i     ON (ct.oid = i.ind
 

This query is cut off so cannot say whether it is the issue or not.

>


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Tom Lane

Date:

17 February 2017, 01:39:31

Tim Bellis <Tim.Bellis@metaswitch.com> writes:
> Even though this is a read only query, is it also expected to be blocked behind the vacuum? Is there a way of getting
indexesfor a table which won't be blocked behind a vacuum? 

It's not the vacuum that's blocking your read-only queries.  It's the
ALTER TABLE, which needs an exclusive lock in order to alter the table's
schema.  The ALTER is queued waiting for the vacuum to finish, and lesser
lock requests queue up behind it.  We could let the non-exclusive lock
requests go ahead of the ALTER, but that would create a severe risk of the
ALTER *never* getting to run.

I'd kill the ALTER and figure on trying again after the vacuum is done.

Also you might want to look into how you got into a situation where
you have an anti-wraparound vacuum that's taking so long to run.
You didn't do something silly like disable autovacuum did you?

            regards, tom lane

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Alvaro Herrera

Date:

17 February 2017, 01:49:32

Tom Lane wrote:

> Also you might want to look into how you got into a situation where
> you have an anti-wraparound vacuum that's taking so long to run.

If there are ALTERs running all the time, regular (non-anti-wraparound)
vacuums would be canceled and never get a chance to run.  Eventually,
autovacuum decides it's had enough and doesn't cancel anymore, so
everyone else gets stuck behind.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Jeff Janes

Date:

17 February 2017, 05:58:56

On Wed, Feb 15, 2017 at 9:30 AM, Tim Bellis <Tim.Bellis@metaswitch.com> wrote:

I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get blocked indefinitely (at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should never take any blocking locks for any significant period of time, and so would like help resolving the issue.

The process blocking the query is:
postgres 21985 11304 98 Feb13 ? 1-14:20:52 postgres: autovacuum worker process <db_name>
which is running the query
autovacuum: VACUUM public.<table_name>

Are you sure it doesn't really say:

autovacuum: VACUUM public.<table_name> (to prevent wraparound)

If it doesn't include the "to prevent wraparound", then it should sacrifice itself as soon as it realizes it is blocking something else. If it is not doing that, something is wrong.

If it does say "(to prevent wraparound)", then see all the other comments on this thread.

Notes:
- This database table is used for about 6 million row writes per day, all of which are then deleted at the end of the day.

How many transactions do those 6 million writes comprise?

- Our application kicks off a manual vacuum against this table each night which doesn't hit this problem, as far as we're aware.

If it were having problems, would you be aware of it? Do you see in the log files the completion of the vacuum? Or look in pg_stat_user_tables to see when last_vacuum was. If it runs every night and succeeds, it is hard to see why wraparound would ever kick in. Unless you are hitting 150,000,000 transactions in a day.

Cheers,

Jeff

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Hannes Erven

Date:

17 February 2017, 14:46:34

Hi Tim,



Am 2017-02-15 um 18:30 schrieb Tim Bellis:
> I have a postgres 9.3.4 database table which (intermittently but reliably)
 > gets into a state where queries get blocked indefinitely
 > [..]
> Notes:
>   - This database table is used for about 6 million row writes per day,
 >     all of which are then deleted at the end of the day.

If you are dumping the contents of the table anyways, why not use
TRUNCATE instead of DELETE? It unlinks and recreates the table data
files, requiring nearly zero IO and analyzing.
Or even drop, and recreate the table with the correct new structure so
you do not even need to ALTER TABLE ?

I'm a bit skeptical of these suggestions since very competent people
have already answered your post and did not come up with this... ;-) the
only drawback I'm aware of is that TRUNCATE will immediatly free disk
space on the OS level, so the table's space will not be "blocked". But
probably the VACUUM you are currently performing will also eventually
release the unused disk space, so this may or may not match the current
behaviour.


Best regards,

    -hannes

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Tim Bellis

Date:

17 February 2017, 19:02:09

The DELETE operations only deletes rows from the previous day. It's possible that there have been rows added that day
whichought not to be deleted, so TRUNCATE wouldn't work. 

But that was a helpful suggestion - thanks!

Tim

-----Original Message-----
From: Hannes Erven [mailto:hannes@erven.at]
Sent: 17 February 2017 11:47
To: pgsql-general@postgresql.org
Cc: Tim Bellis <Tim.Bellis@metaswitch.com>
Subject: Re: [GENERAL] Autovacuum stuck for hours, blocking queries

Hi Tim,



Am 2017-02-15 um 18:30 schrieb Tim Bellis:
> I have a postgres 9.3.4 database table which (intermittently but
> reliably)
 > gets into a state where queries get blocked indefinitely  > [..]
> Notes:
>   - This database table is used for about 6 million row writes per
> day,
 >     all of which are then deleted at the end of the day.

If you are dumping the contents of the table anyways, why not use TRUNCATE instead of DELETE? It unlinks and recreates
thetable data files, requiring nearly zero IO and analyzing. 
Or even drop, and recreate the table with the correct new structure so you do not even need to ALTER TABLE ?

I'm a bit skeptical of these suggestions since very competent people have already answered your post and did not come
upwith this... ;-) the only drawback I'm aware of is that TRUNCATE will immediatly free disk space on the OS level, so
thetable's space will not be "blocked". But probably the VACUUM you are currently performing will also eventually
releasethe unused disk space, so this may or may not match the current behaviour. 


Best regards,

    -hannes

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Tim Bellis

Date:

17 February 2017, 21:36:06

From: Jeff Janes [mailto:jeff.janes@gmail.com]
Sent: 17 February 2017 02:59
To: Tim Bellis <Tim.Bellis@metaswitch.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Autovacuum stuck for hours, blocking queries

On Wed, Feb 15, 2017 at 9:30 AM, Tim Bellis <Tim.Bellis@metaswitch.com> wrote:

I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get blocked indefinitely (at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should never take any blocking locks for any significant period of time, and so would like help resolving the issue.

The process blocking the query is:
postgres 21985 11304 98 Feb13 ? 1-14:20:52 postgres: autovacuum worker process <db_name>
which is running the query
autovacuum: VACUUM public.<table_name>

Are you sure it doesn't really say:

autovacuum: VACUUM public.<table_name> (to prevent wraparound)

[Tim Bellis] It doesn’t. I was using the query from https://wiki.postgresql.org/wiki/Lock_Monitoring and looking at the ‘current_statement_in_blocking_process’ column. Is there a different query I should be using?

If it doesn't include the "to prevent wraparound", then it should sacrifice itself as soon as it realizes it is blocking something else. If it is not doing that, something is wrong.

If it does say "(to prevent wraparound)", then see all the other comments on this thread.

Notes:
- This database table is used for about 6 million row writes per day, all of which are then deleted at the end of the day.

How many transactions do those 6 million writes comprise?

[Tim Bellis] I’m not sure - I think it’s 6 million transactions, but there might be some batching going on that I’m unaware of. What would this affect? (I can dig in if necessary)

(I might have been slightly wrong in characterising the exact behaviour; the table might be cleared every hour rather than every day, but there are still 6 million writes per day)

- Our application kicks off a manual vacuum against this table each night which doesn't hit this problem, as far as we're aware.

[Tim Bellis] I shall investigate this.

Cheers,

Jeff

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Hannes Erven

Date:

17 February 2017, 23:06:18

Hi Tim,


Am 2017-02-17 um 17:02 schrieb Tim Bellis:
> The DELETE operations only deletes rows from the
 > previous day. It's possible that there have been rows
 > added that day which ought not to be deleted, so
 > TRUNCATE wouldn't work.

OK, then I'll try two other suggestions:

- use table partitioning (
https://www.postgresql.org/docs/9.3/static/ddl-partitioning.html )

- if the number of rows you need to keep is small, you could try
something like this:
LOCK TABLE yourtable <a suitable LOCKMODE>;
CREATE TEMPORARY TABLE keep AS SELECT * FROM yourtable WHERE <keep>;
TRUNCATE yourtable;
INSERT INTO yourtable SELECT * from keep;
COMMIT;


Best regards,

    -hannes


> -----Original Message-----
> From: Hannes Erven [mailto:hannes@erven.at]
> Sent: 17 February 2017 11:47
> To: pgsql-general@postgresql.org
> Cc: Tim Bellis <Tim.Bellis@metaswitch.com>
> Subject: Re: [GENERAL] Autovacuum stuck for hours, blocking queries
>
> Hi Tim,
>
>
>
> Am 2017-02-15 um 18:30 schrieb Tim Bellis:
>> I have a postgres 9.3.4 database table which (intermittently but
>> reliably)
>  > gets into a state where queries get blocked indefinitely  > [..]
>> Notes:
>>   - This database table is used for about 6 million row writes per
>> day,
>  >     all of which are then deleted at the end of the day.
>
> If you are dumping the contents of the table anyways, why not use TRUNCATE instead of DELETE? It unlinks and
recreatesthe table data files, requiring nearly zero IO and analyzing. 
> Or even drop, and recreate the table with the correct new structure so you do not even need to ALTER TABLE ?
>
> I'm a bit skeptical of these suggestions since very competent people have already answered your post and did not come
upwith this... ;-) the only drawback I'm aware of is that TRUNCATE will immediatly free disk space on the OS level, so
thetable's space will not be "blocked". But probably the VACUUM you are currently performing will also eventually
releasethe unused disk space, so this may or may not match the current behaviour. 
>
>
> Best regards,
>
>     -hannes
>
>
>

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Rakesh Kumar

Date:

17 February 2017, 23:38:39

LOCK TABLE yourtable <a suitable LOCKMODE>;
CREATE TEMPORARY TABLE keep AS SELECT * FROM yourtable WHERE <keep>;
TRUNCATE yourtable;
INSERT INTO yourtable SELECT * from keep;
COMMIT;
===
the above snippet assumes truncate in PG can be in a transaction. In other words, while truncate by itself
is atomic, it can't be rolled back. So in the above case, if "INSERT INTO yourtable SELECT * from keep;" and
we rollback, will it rollback yourtable.

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Scott Marlowe

Date:

18 February 2017, 07:32:05

On Fri, Feb 17, 2017 at 1:38 PM, Rakesh Kumar
<rakeshkumar464@outlook.com> wrote:
> LOCK TABLE yourtable <a suitable LOCKMODE>;
> CREATE TEMPORARY TABLE keep AS SELECT * FROM yourtable WHERE <keep>;
> TRUNCATE yourtable;
> INSERT INTO yourtable SELECT * from keep;
> COMMIT;
> ===
> the above snippet assumes truncate in PG can be in a transaction. In other words, while truncate by itself
> is atomic, it can't be rolled back. So in the above case, if "INSERT INTO yourtable SELECT * from keep;" and
> we rollback, will it rollback yourtable.


Yes it can. Truncate has been rollbackable for a while now.

begin;
create table
insert into table
truncate old table
. something goes wrong .
rollback;

Unless I misunderstand your meaning.

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Michael Paquier

Date:

18 February 2017, 10:54:10

On Sat, Feb 18, 2017 at 1:32 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> Yes it can. Truncate has been rollbackable for a while now.

Per the docs:
https://www.postgresql.org/docs/9.6/static/sql-truncate.html
"TRUNCATE is transaction-safe with respect to the data in the tables:
the truncation will be safely rolled back if the surrounding
transaction does not commit."
In short yes a transaction doing a truncate can be rollbacked.
--
Michael

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Adrian Klaver

Date:

18 February 2017, 18:02:36

On 02/17/2017 11:54 PM, Michael Paquier wrote:
> On Sat, Feb 18, 2017 at 1:32 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> Yes it can. Truncate has been rollbackable for a while now.
>
> Per the docs:
> https://www.postgresql.org/docs/9.6/static/sql-truncate.html
> "TRUNCATE is transaction-safe with respect to the data in the tables:
> the truncation will be safely rolled back if the surrounding
> transaction does not commit."
> In short yes a transaction doing a truncate can be rollbacked.
>

I think the part that confuses people into thinking it can not be
rollbacked is this:

"TRUNCATE is not MVCC-safe. After truncation, the table will appear
empty to concurrent transactions, if they are using a snapshot taken
before the truncation occurred. See Section 13.5 for more details."

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Merlin Moncure

Date:

21 February 2017, 04:40:01

On Thursday, February 16, 2017, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tim Bellis <Tim.Bellis@metaswitch.com> writes:
> Even though this is a read only query, is it also expected to be blocked behind the vacuum? Is there a way of getting indexes for a table which won't be blocked behind a vacuum?

It's not the vacuum that's blocking your read-only queries. It's the
ALTER TABLE, which needs an exclusive lock in order to alter the table's
schema. The ALTER is queued waiting for the vacuum to finish, and lesser
lock requests queue up behind it. We could let the non-exclusive lock
requests go ahead of the ALTER, but that would create a severe risk of the
ALTER *never* getting to run.

I'd kill the ALTER and figure on trying again after the vacuum is done.

I've been drilled by this and similar lock stacking issues enough times to make me near 100% sure deferring the ALTER would be the better choice

merlin

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Tim Bellis

Date:

21 February 2017, 13:16:40


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 16 February 2017 22:40
To: Tim Bellis <Tim.Bellis@metaswitch.com>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>; pgsql-general@postgresql.org; Alvaro Herrera <alvherre@2ndquadrant.com>;
ScottMarlowe <scott.marlowe@gmail.com> 
Subject: Re: [GENERAL] Autovacuum stuck for hours, blocking queries

Tim Bellis <Tim.Bellis@metaswitch.com> writes:
> > Even though this is a read only query, is it also expected to be blocked behind the vacuum? Is there a way of
gettingindexes for a table which won't be blocked behind a vacuum? 

> It's not the vacuum that's blocking your read-only queries.  It's the ALTER TABLE, which needs an exclusive lock in
orderto alter the table's schema.  The ALTER is queued waiting for the vacuum to finish, and lesser lock requests queue
upbehind it.  We could let the non-exclusive lock requests go ahead of the ALTER, but that would create a severe risk
ofthe ALTER *never* getting to run. 

The lock monitoring query (https://wiki.postgresql.org/wiki/Lock_Monitoring) said that the blocking_pid and the
current_statement_in_blocking_processfor the queries reading the index data was the autovacuum, not the ALTER. Am I
readingthe output wrong? Does it not correctly represent the chain of locks? 

> I'd kill the ALTER and figure on trying again after the vacuum is done.

> Also you might want to look into how you got into a situation where you have an anti-wraparound vacuum that's taking
solong to run. 
> You didn't do something silly like disable autovacuum did you?
No, autovacuum is on (and this is an autovacuum which is in progress). But I will look at why I'm getting a blocking
autovacuum.

>            regards, tom lane

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Jeff Janes

Date:

23 February 2017, 00:19:11

On Mon, Feb 20, 2017 at 5:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Thursday, February 16, 2017, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Tim Bellis <Tim.Bellis@metaswitch.com> writes:
> Even though this is a read only query, is it also expected to be blocked behind the vacuum? Is there a way of getting indexes for a table which won't be blocked behind a vacuum?

It's not the vacuum that's blocking your read-only queries. It's the
ALTER TABLE, which needs an exclusive lock in order to alter the table's
schema. The ALTER is queued waiting for the vacuum to finish, and lesser
lock requests queue up behind it. We could let the non-exclusive lock
requests go ahead of the ALTER, but that would create a severe risk of the
ALTER *never* getting to run.

I'd kill the ALTER and figure on trying again after the vacuum is done.

I've been drilled by this and similar lock stacking issues enough times to make me near 100% sure deferring the ALTER would be the better choice

This seems like a rather one-sided observation. How could you know how often the unimplemented behavior also would have "drilled" you, since it is unimplemented?

There was a proposal for a "LOCK TABLE...DEFERABLE" which would allow other requestors jump the queue if they were compatible with the held lock. If that is implemented, then you would just manually lock the table deferably before invoking the ALTER TABLE command, if that is the behavior you wanted (but it wouldn't work for things that can't be run in transactions)

Ideally each requestor would specify if they will hold the lock for a long timer or a short time. Them a short requestor which is blocked behind a long requestor could let other compatible-with-held requests jump over it. But once it was only blocked by short locks, it would reassert the normal order, so it can't get permanently blocked by a constantly overlapping stream of short locks. But how would you get all lock requestors to provide a reasonable estimate?

Cheers,

Jeff

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Jeff Janes

Date:

23 February 2017, 02:48:37

On Fri, Feb 17, 2017 at 10:36 AM, Tim Bellis <Tim.Bellis@metaswitch.com> wrote:

From: Jeff Janes [mailto:jeff.janes@gmail.com]
Sent: 17 February 2017 02:59
To: Tim Bellis <Tim.Bellis@metaswitch.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Autovacuum stuck for hours, blocking queries

On Wed, Feb 15, 2017 at 9:30 AM, Tim Bellis <Tim.Bellis@metaswitch.com> wrote:
I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get blocked indefinitely (at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should never take any blocking locks for any significant period of time, and so would like help resolving the issue.

The process blocking the query is:
postgres 21985 11304 98 Feb13 ? 1-14:20:52 postgres: autovacuum worker process <db_name>
which is running the query
autovacuum: VACUUM public.<table_name>

Are you sure it doesn't really say:

autovacuum: VACUUM public.<table_name> (to prevent wraparound)
[Tim Bellis] It doesn’t. I was using the query from https://wiki.postgresql.org/wiki/Lock_Monitoring and looking at the ‘current_statement_in_blocking_process’ column. Is there a different query I should be using?

That query seems to be a bit mangled. At one time, it only found row-level locks. Someone changed that, but didn't remove the comment "these only find row-level locks, not object-level locks"

Also, the "WHERE NOT blocked_locks.GRANTED" should perhaps be:

WHERE NOT blocked_locks.GRANTED and blocking_locks.GRANTED;

As it is, every waiting query reports that it is waiting on all of its fellow victims as well as the thing(s) actually blocking it. But my WHERE clause is not really correct either, as it is possible that it is one blocked thing is being blocked by a different blocked thing which is ahead of it in the queue, when without that intervening blocked requestor it could be immediately granted if its request mode is compatible with the held mode(s). I don't think there is a query that can reveal what is most immediately blocking it.

But, I don't see how this explains what you see. An autovacuum without "(to prevent wraparound)" should not block anything for much more than a second (unless you changed deadlock_timeout) and should not be blocked by anything either as it just gives up on the operation if the lock is not immediately available.

Cheers,

Jeff

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Bill Moran

Date:

23 February 2017, 13:42:42

On Wed, 22 Feb 2017 13:19:11 -0800
Jeff Janes <jeff.janes@gmail.com> wrote:

> On Mon, Feb 20, 2017 at 5:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> >
> > On Thursday, February 16, 2017, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> >> Tim Bellis <Tim.Bellis@metaswitch.com> writes:
> >> > Even though this is a read only query, is it also expected to be
> >> blocked behind the vacuum? Is there a way of getting indexes for a table
> >> which won't be blocked behind a vacuum?
> >>
> >> It's not the vacuum that's blocking your read-only queries.  It's the
> >> ALTER TABLE, which needs an exclusive lock in order to alter the table's
> >> schema.  The ALTER is queued waiting for the vacuum to finish, and lesser
> >> lock requests queue up behind it.  We could let the non-exclusive lock
> >> requests go ahead of the ALTER, but that would create a severe risk of the
> >> ALTER *never* getting to run.
> >>
> >> I'd kill the ALTER and figure on trying again after the vacuum is done.
> >>
> >>
> > I've been drilled by this and similar lock stacking issues enough times to
> > make me near 100% sure deferring the ALTER would be the better choice
> >
> >
> This seems like a rather one-sided observation.  How could you know how
> often the unimplemented behavior also would have "drilled" you, since it is
> unimplemented?
>
> There was a proposal for a "LOCK TABLE...DEFERABLE" which would allow other
> requestors jump the queue if they were compatible with the held lock.  If
> that is implemented, then you would just manually lock the table deferably
> before invoking the ALTER TABLE command, if that is the behavior you wanted
> (but it wouldn't work for things that can't be run in transactions)

This seems redundant to me.

We already have LOCK ... NOWAIT. It's fairly trivial to write an upgrade
script that uses LOCK to explicitly lock tables that it's going to ALTER,
then busy-waits if the lock is not immediately grantable.

The fact that so many ORMs and similar tools don't take advantage of that
functionality is rather depressing.

In my experience, I've also seen heavily loaded systems that this wouldn't
work on, essentially because there is _always_ _some_ lock on every table.
This is a case where experienced developers are required to take some
extra time to coordinate their upgrades to work around the high load. But
the proposed LOCK TABLE ... DEFERRABLE wouldn't help there either, because
the ALTER would be deferred indefinitely.

Personally, I feel like the existing behavior is preferrable. Software
teams need to take the time to understand the locking implications of their
actions or they'll have nothing but trouble anyway.

As I've seen time and again: writing an application that handles low load
and low concurrency is fairly trivial, but scaling that app up to high
load and/or high concurrency generally sorts out the truely brilliant
developers from the merely average.

--
Bill Moran <wmoran@potentialtech.com>

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Merlin Moncure

Date:

23 February 2017, 17:32:56

On Wed, Feb 22, 2017 at 3:19 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Feb 20, 2017 at 5:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Thursday, February 16, 2017, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>
>>> Tim Bellis <Tim.Bellis@metaswitch.com> writes:
>>> > Even though this is a read only query, is it also expected to be
>>> > blocked behind the vacuum? Is there a way of getting indexes for a table
>>> > which won't be blocked behind a vacuum?
>>>
>>> It's not the vacuum that's blocking your read-only queries.  It's the
>>> ALTER TABLE, which needs an exclusive lock in order to alter the table's
>>> schema.  The ALTER is queued waiting for the vacuum to finish, and lesser
>>> lock requests queue up behind it.  We could let the non-exclusive lock
>>> requests go ahead of the ALTER, but that would create a severe risk of
>>> the
>>> ALTER *never* getting to run.
>>>
>>> I'd kill the ALTER and figure on trying again after the vacuum is done.
>>
>> I've been drilled by this and similar lock stacking issues enough times to
>> make me near 100% sure deferring the ALTER would be the better choice
>
> This seems like a rather one-sided observation.  How could you know how
> often the unimplemented behavior also would have "drilled" you, since it is
> unimplemented?

Well, that I can't really say, but at least in my case ALTER TABLE in
the face of concurrent application activity can wait but locking
tables for reading for an indefinite period will rapidly destabilize
the system.  An example of this usage is replacing partitions on a
warehouse table.  About half of my P1s over the last 12 months ago are
relating to locking problem of some kind.

So I end up during workarounds such as issuing "LOCK...NOWAIT" in a
sleep loop :( or application restructuring, especially minimizing use
of TRUNCATE.

I do think instrumentation around locking behaviors would be helpful.
Allowing (optionally) waiters to leapfrog in if they can clear would
be wonderful as would being able to specify maximum wait timeouts
inside a transaction.  FWIW, I'm not sure this behavior makes sense
attached to LOCK, I'd rather see them attached generally to SET
TRANSACTION -- my 0.02$ (talk is cheap, etc).

merlin

Re: [GENERAL] Autovacuum stuck for hours, blocking queries

From

Jeff Janes

Date:

23 February 2017, 18:56:27

On Thu, Feb 23, 2017 at 2:42 AM, Bill Moran <wmoran@potentialtech.com> wrote:

On Wed, 22 Feb 2017 13:19:11 -0800
Jeff Janes <jeff.janes@gmail.com> wrote:

> On Mon, Feb 20, 2017 at 5:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> >
> > On Thursday, February 16, 2017, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> >> Tim Bellis <Tim.Bellis@metaswitch.com> writes:
> >> > Even though this is a read only query, is it also expected to be
> >> blocked behind the vacuum? Is there a way of getting indexes for a table
> >> which won't be blocked behind a vacuum?
> >>
> >> It's not the vacuum that's blocking your read-only queries. It's the
> >> ALTER TABLE, which needs an exclusive lock in order to alter the table's
> >> schema. The ALTER is queued waiting for the vacuum to finish, and lesser
> >> lock requests queue up behind it. We could let the non-exclusive lock
> >> requests go ahead of the ALTER, but that would create a severe risk of the
> >> ALTER *never* getting to run.
> >>
> >> I'd kill the ALTER and figure on trying again after the vacuum is done.
> >>
> >>
> > I've been drilled by this and similar lock stacking issues enough times to
> > make me near 100% sure deferring the ALTER would be the better choice
> >
> >
> This seems like a rather one-sided observation. How could you know how
> often the unimplemented behavior also would have "drilled" you, since it is
> unimplemented?
>
> There was a proposal for a "LOCK TABLE...DEFERABLE" which would allow other
> requestors jump the queue if they were compatible with the held lock. If
> that is implemented, then you would just manually lock the table deferably
> before invoking the ALTER TABLE command, if that is the behavior you wanted
> (but it wouldn't work for things that can't be run in transactions)

This seems redundant to me.

We already have LOCK ... NOWAIT. It's fairly trivial to write an upgrade
script that uses LOCK to explicitly lock tables that it's going to ALTER,
then busy-waits if the lock is not immediately grantable.

As fairly trivial as it is, I bet I would mess it up a few times before I got it right. And then it would probably still be wrong in corner cases. What if it fails not because the lock is unavailable, but for some obscure error other than the ones anticipated or encountered during testing? And busy-waiting is generally nasty and a waste of resources.

The fact that so many ORMs and similar tools don't take advantage of that
functionality is rather depressing.

In my experience, I've also seen heavily loaded systems that this wouldn't
work on, essentially because there is _always_ _some_ lock on every table.
This is a case where experienced developers are required to take some
extra time to coordinate their upgrades to work around the high load. But
the proposed LOCK TABLE ... DEFERRABLE wouldn't help there either, because
the ALTER would be deferred indefinitely.

True. That is why it would be available only upon request, not the new default.

Personally, I feel like the existing behavior is preferrable. Software
teams need to take the time to understand the locking implications of their
actions or they'll have nothing but trouble anyway.

As I've seen time and again: writing an application that handles low load
and low concurrency is fairly trivial, but scaling that app up to high
load and/or high concurrency generally sorts out the truely brilliant
developers from the merely average.

So why not give the merely average some better tools?

Cheers,

Jeff