Thread: Re: BUG #14938: ALTER TABLE hang/ poor performance

Re: BUG #14938: ALTER TABLE hang/ poor performance

From

Tomas Vondra

Date:

01 December 2017, 20:49:14

On 12/01/2017 12:30 AM, dipesh.kamdar@gmail.com wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      14938
> Logged by:          Dipesh Kamdar
> Email address:      dipesh.kamdar@gmail.com
> PostgreSQL version: 9.5.3
> Operating system:   Linux
> Description:        
> 
> Postgres : 9.5.3
> Kernal : 3.10.0-327.13.1.el7.x86_64
> Linux :  x86_64 x86_64 x86_64 GNU/Linux
> RAM 128GB
> DISK : 1.5TB
> 

9.5.3 was released more than a year ago, you might be hitting one of the
bugs fixed since then.

> 
> We have nightly job that removed millions of records from multiple table. 
> 
> We had following approach.
> 1.    Delete data from table in batch of 50000
> 
> Problem with above approach many time autovacuum on table and delete
> statement on table create deadlock. 
> In order to avoid above problem, we have taken following approach.
> 

What do you mean by deadlock? Moreover, autovacuum certainly should not
deadlock with anything (certainly not with DELETE). In the worst case
autovacuum should cancel itself automatically.

I also don't quite understand why you do the delete in batches of 50k
rows, to be honest.

> 1    Turn off autovacuum on table by using ALTER TABLE <tablename>  SET (
> autovacuum_enabled=false);
> 2.    Delete data from table in batch of 50000 
> 3.    Turn On autovacuum on table by using ALTER TABLE <tablename>  SET (
> autovacuum_enabled=true);
> 
> 
> Problem with second approach ALTER TABLE tablename SET ( 
> autovacuum_enabled=FALSE) get hang very often. I am not seeing
> anything pg_lock that is waiting for resource and any other process
> blocking. Process manytime take 12hour , 13hours etc.
Are you sure there's nothing in pg_locks? What does

  SELECT pg_backend_pid();
  ALTER TABLE tablename SET (autovacuum_enabled=FALSE);

and then in ahother session (when the ALTER TABLE gets stuck)

  SELECT * FROM pg_locks WHERE pid = $PID; <- pg_backend_pid
  SELECT * FROM pg_locks WHERE NOT granted;

show?

If it really does not show any waiting locks, then you'll need to
inspect it using gdb. Install debuginfo packages, and then do

  gdb -p $PID
  (gdb) bt

FWIW I really doubt you really neet this process of disabling/enabling
autovacuum. This should work just fine with autovacuum running.

>
> Found article on net regarding vacuum stuck 
> 
> http://rhaas.blogspot.com/2011/03/troubleshooting-stuck-vacuums.html
>

That is entirely unrelated. The article is about the autovacuum process
getting stuck, while you're observing ALTER TABLE getting stuck.

> Database monitoring script is not reporting any waiting, pg_log is not
> reporting any error or deadlock.
> Is there anyway we can figure out any process is blocking or waiting for
> resource etc.
> My basic understanding about this ALTER TABLE SET command it updated record
> in pg_classs.reloptions column. 
> Does ALTER TABLE  SET option block complete table? 
> 

It does need a lock on the table, yes. So if there are any long-running
queries accessing that table, it may need to wait for them to complete.
But that should be visible in pg_locks, and you claim there's nothing
(no locks waiting).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #14938: ALTER TABLE hang/ poor performance

From

Dipesh Kamdar

Date:

01 December 2017, 23:37:39

#1 9.5.3 was released more than a year ago, you might be hitting one of the bugs fixed since then. >>> Do you know any bug related ? #2 What do you mean by deadlock?Moreover, autovacuum certainly should not deadlock with anything (certainly not with DELETE) >>> Please take a look of SEQ deleting TAB1 data of 50K ------ T1 autovacuum check TAB1 ------- T2 ( Since delete is running skipping table ) delete completed ---- T3 before starting next delete on same table autovacuum started on TAB1 ---- T4 DELETE statement is waiting for autovacuum to release lock. after some time later reported deadlock in log and killed DELETE process. #3 I also don't quite understand why you do the delete in batches of 50k rows, to be honest >> It is client facing application with close to 1.5 billion records and has multiple FK other and indexes removing again millions of record in single statement means putting entire online application on fire. #4 I am running following SQL as monitoring point of every 10min. Let me know if anything is missing above monitor script. SELECT COALESCE(blockingl.relation*::*regclass*::*text,blockingl.locktype) as locked_item, now() *-* blockeda.query_start AS waiting_duration, blockeda.pid AS blocked_pid, blockeda.query as blocked_query, blockedl.mode as blocked_mode, blockinga.pid AS blocking_pid, blockinga.query as blocking_query, blockingl.mode as blocking_mode FROM pg_catalog.pg_locks blockedl INNER JOIN pg_stat_activity blockeda ON blockedl.pid *=* blockeda.pid INNER JOIN pg_catalog.pg_locks blockingl ON( ( (blockingl.transactionid*=*blockedl.transactionid) OR (blockingl.relation*=*blockedl.relation AND blockingl.locktype*=*blockedl.locktype) ) AND blockedl.pid *!=* blockingl.pid) INNER JOIN pg_stat_activity blockinga ON blockingl.pid *=* blockinga.pid AND blockinga.datid *=* blockeda.datid WHERE NOT blockedl.granted AND blockinga.datname *=* current_database() #5 It is production , We will not able to attach process to gdb. #6 My main concern is not coming in pg_locks table. Is it advice to use update statement on pg_class.reloptions column rather than using ALTER TABLE SET option. -Dipesh On Fri, Dec 1, 2017 at 6:49 AM, Tomas Vondra wrote: > > On 12/01/2017 12:30 AM, dipesh.kamdar@gmail.com wrote: > > The following bug has been logged on the website: > > > > Bug reference: 14938 > > Logged by: Dipesh Kamdar > > Email address: dipesh.kamdar@gmail.com > > PostgreSQL version: 9.5.3 > > Operating system: Linux > > Description: > > > > Postgres : 9.5.3 > > Kernal : 3.10.0-327.13.1.el7.x86_64 > > Linux : x86_64 x86_64 x86_64 GNU/Linux > > RAM 128GB > > DISK : 1.5TB > > > > 9.5.3 was released more than a year ago, you might be hitting one of the > bugs fixed since then. > > > > > We have nightly job that removed millions of records from multiple table. > > > > We had following approach. > > 1. Delete data from table in batch of 50000 > > > > Problem with above approach many time autovacuum on table and delete > > statement on table create deadlock. > > In order to avoid above problem, we have taken following approach. > > > > What do you mean by deadlock? Moreover, autovacuum certainly should not > deadlock with anything (certainly not with DELETE). In the worst case > autovacuum should cancel itself automatically. > > I also don't quite understand why you do the delete in batches of 50k > rows, to be honest. > > > 1 Turn off autovacuum on table by using ALTER TABLE SET > ( > > autovacuum_enabled=false); > > 2. Delete data from table in batch of 50000 > > 3. Turn On autovacuum on table by using ALTER TABLE SET ( > > autovacuum_enabled=true); > > > > > > Problem with second approach ALTER TABLE tablename SET ( > > autovacuum_enabled=FALSE) get hang very often. I am not seeing > > anything pg_lock that is waiting for resource and any other process > > blocking. Process manytime take 12hour , 13hours etc. > Are you sure there's nothing in pg_locks? What does > > SELECT pg_backend_pid(); > ALTER TABLE tablename SET (autovacuum_enabled=FALSE); > > and then in ahother session (when the ALTER TABLE gets stuck) > > SELECT * FROM pg_locks WHERE pid = $PID; <- pg_backend_pid > SELECT * FROM pg_locks WHERE NOT granted; > > show? > > If it really does not show any waiting locks, then you'll need to > inspect it using gdb. Install debuginfo packages, and then do > > gdb -p $PID > (gdb) bt > > FWIW I really doubt you really neet this process of disabling/enabling > autovacuum. This should work just fine with autovacuum running. > > > > > Found article on net regarding vacuum stuck > > > > http://rhaas.blogspot.com/2011/03/troubleshooting-stuck-vacuums.html > > > > That is entirely unrelated. The article is about the autovacuum process > getting stuck, while you're observing ALTER TABLE getting stuck. > > > Database monitoring script is not reporting any waiting, pg_log is not > > reporting any error or deadlock. > > Is there anyway we can figure out any process is blocking or waiting for > > resource etc. > > My basic understanding about this ALTER TABLE SET command it updated > record > > in pg_classs.reloptions column. > > Does ALTER TABLE SET option block complete table? > > > > It does need a lock on the table, yes. So if there are any long-running > queries accessing that table, it may need to wait for them to complete. > But that should be visible in pg_locks, and you claim there's nothing > (no locks waiting). > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >

Re: BUG #14938: ALTER TABLE hang/ poor performance

From

Tomas Vondra

Date:

02 December 2017, 00:30:33

On 12/01/2017 06:37 PM, Dipesh Kamdar wrote:
> #1    9.5.3 was released more than a year ago, you might be hitting one
> of the
> bugs fixed since then.
> 
>>>> Do you know any bug related ?
> 

No. Considering how little we know about the issue you're observing,
it's quite impossible to point to any particular bugfix.

But it's a good practice to run current version, and it's rather
annoying to help people only to discover they've been hitting an issue
fixed ages ago. To the extent that it's likely discouraging people from
investigating the bug report.

> #2 What do you mean by deadlock?Moreover, autovacuum certainly should not
> deadlock with anything (certainly not with DELETE)
> 
>>>> Please take a look of SEQ
>    deleting  TAB1  data of    50K   ------   T1
>    autovacuum  check TAB1      -------  T2  ( Since delete is running
> skipping table )
>    delete completed    ---- T3
>     before starting next delete on same table autovacuum started on
> TAB1    ---- T4
>     DELETE statement is waiting for autovacuum to release lock.    
>     after some time later reported deadlock in log and killed DELETE
> process.
> 

DELETE acquires ROW EXCLUSIVE lock
VACUUM acquires SHARE UPDATE EXCLUSIVE lock

These two lock modes do not conflict with each other, i.e. VACUUM (or
autovacuum) will not wait for DELETE (and vice versa).

Perhaps you're running the DELETE in a transaction, and that transaction
has acquired other (stronger) locks? Those locks will be held until the
end of the transaction, but that has nothing to do with DELETE.

Show us what locks the two sessions hold.

> #3  I also don't quite understand why you do the delete in batches of 50k
> rows, to be honest
> 
>>> It is client facing application with close to 1.5 billion records and has multiple FK other and indexes removing
againmillions of record in single statement means 

>      putting entire online application on fire.
> 

OK, so you're using this to throttle the DELETE process.

> #4 I am running following SQL as monitoring point of every 10min. 
> Let me know if anything is missing above monitor script.
> 
> SELECT
> 
>   COALESCE(blockingl.relation*::*regclass*::*text,blockingl.locktype) as
> locked_item,
> 
>   now() *-* blockeda.query_start AS waiting_duration, blockeda.pid AS
> blocked_pid,
> 
>   blockeda.query as blocked_query, blockedl.mode as blocked_mode,
> 
>   blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
> 
>   blockingl.mode as blocking_mode
> 
> FROM pg_catalog.pg_locks blockedl
> 
>     INNER JOIN pg_stat_activity blockeda ON blockedl.pid *=* blockeda.pid
> 
>     INNER JOIN pg_catalog.pg_locks blockingl ON(
> 
>                 ( (blockingl.transactionid*=*blockedl.transactionid) OR
> 
>                (blockingl.relation*=*blockedl.relation AND
> blockingl.locktype*=*blockedl.locktype)
> 
>       ) AND blockedl.pid *!=* blockingl.pid)
> 
>     INNER JOIN pg_stat_activity blockinga ON blockingl.pid *=*
> blockinga.pid  AND blockinga.datid *=* blockeda.datid
> 
> WHERE NOT blockedl.granted
> 
>       AND blockinga.datname *=* current_database()
> 

I don't care about a monitoring query that runs every 10 minutes. The
deadlock detection runs after 1 second lock wait by default, which means
the monitoring query has pretty much no chance of observing that.

You need to simulate the issue and collect data from pg_locks while it's
happening. And collect everything from pg_locks - the lock dependencies
may easily be more complex.

It's also a good idea to share the deadlock log message, and info what
the processes were doing.

> 
> #5 It is production , We will not able to attach process to gdb.
> 

OK. Assuming it's a locking issue, we should be able to debug it using
just the pg_locks data.

>
> #6 My main concern is not coming in pg_locks table. 
> 
> Is it advice to use update statement on pg_class.reloptions column
> rather than using ALTER TABLE SET option.
> 

No. Touching the catalogs directly is pretty bad idea. The locking is
there for a reason.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #14938: ALTER TABLE hang/ poor performance

From

Dipesh Kamdar

Date:

06 December 2017, 05:13:44

Here is few day back log msg. Removing actual table name and SQL but keeping actual msg.

log file msg:

2017-11-30 13:24:34.513 GMT]: LOG: duration: 49082304.526 ms execute <unnamed>: ALTER TABLE tablename1 SET (autovacuum_enabled= FALSE , toast.autovacuum_enabled=FALSE )

[2017-11-30 13:24:35.517 GMT]: LOG: duration: 30496621.974 ms parse <unnamed>: SELECT column list FROM tablename2 INNER JOIN tablename1 ON condation1.

[2017-11-30 13:37:10.106 GMT]: LOG: duration: 31239025.847 ms execute <unnamed>: ANALYZE

We have following setting enable in postgres config file.

log_min_duration_statement = 300000 # 5min

deadlock_timeout = 1s # default

database monitoring script is not reporting any process is waiting for lock. In this case deadlock is not occurring but it hangs almost 12 hours. I am seeing this pattern once a day in log file with other table during same batch job run. Not any other SQL reported in log that say took more than 5 min.

Why don't we see deadlock? Why don't we see any lock waiting in pg_locks table ?

Thanks!

Dipesh

On Fri, Dec 1, 2017 at 10:30 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 12/01/2017 06:37 PM, Dipesh Kamdar wrote:
> #1   9.5.3 was released more than a year ago, you might be hitting one
> of the
> bugs fixed since then.
>
>>>> Do you know any bug related ?
>

No. Considering how little we know about the issue you're observing,
it's quite impossible to point to any particular bugfix.

But it's a good practice to run current version, and it's rather
annoying to help people only to discover they've been hitting an issue
fixed ages ago. To the extent that it's likely discouraging people from
investigating the bug report.

> #2 What do you mean by deadlock?Moreover, autovacuum certainly should not
> deadlock with anything (certainly not with DELETE)
>
>>>> Please take a look of SEQ
> deleting TAB1 data of 50K ------ T1
> autovacuum check TAB1 ------- T2 ( Since delete is running
> skipping table )
> delete completed ---- T3
> before starting next delete on same table autovacuum started on
> TAB1 ---- T4
> DELETE statement is waiting for autovacuum to release lock.
> after some time later reported deadlock in log and killed DELETE
> process.
>

DELETE acquires ROW EXCLUSIVE lock
VACUUM acquires SHARE UPDATE EXCLUSIVE lock

These two lock modes do not conflict with each other, i.e. VACUUM (or
autovacuum) will not wait for DELETE (and vice versa).

Perhaps you're running the DELETE in a transaction, and that transaction
has acquired other (stronger) locks? Those locks will be held until the
end of the transaction, but that has nothing to do with DELETE.

Show us what locks the two sessions hold.

> #3  I also don't quite understand why you do the delete in batches of 50k
> rows, to be honest
>
>>> It is client facing application with close to 1.5 billion records and has multiple FK other and indexes removing again millions of record in single statement means
> putting entire online application on fire.
>

OK, so you're using this to throttle the DELETE process.

> #4 I am running following SQL as monitoring point of every 10min.
> Let me know if anything is missing above monitor script.
>
> SELECT
>
> COALESCE(blockingl.relation*::*regclass*::*text,blockingl.locktype) as
> locked_item,
>
> now() *-* blockeda.query_start AS waiting_duration, blockeda.pid AS
> blocked_pid,
>
> blockeda.query as blocked_query, blockedl.mode as blocked_mode,
>
> blockinga.pid AS blocking_pid, blockinga.query as blocking_query,
>
> blockingl.mode as blocking_mode
>
> FROM pg_catalog.pg_locks blockedl
>
> INNER JOIN pg_stat_activity blockeda ON blockedl.pid *=* blockeda.pid
>
> INNER JOIN pg_catalog.pg_locks blockingl ON(
>
>   ( (blockingl.transactionid*=*blockedl.transactionid) OR
>
> (blockingl.relation*=*blockedl.relation AND
> blockingl.locktype*=*blockedl.locktype)
>
>   ) AND blockedl.pid *!=* blockingl.pid)
>
> INNER JOIN pg_stat_activity blockinga ON blockingl.pid *=*
> blockinga.pid  AND blockinga.datid *=* blockeda.datid
>
> WHERE NOT blockedl.granted
>
> AND blockinga.datname *=* current_database()
>

I don't care about a monitoring query that runs every 10 minutes. The
deadlock detection runs after 1 second lock wait by default, which means
the monitoring query has pretty much no chance of observing that.

You need to simulate the issue and collect data from pg_locks while it's
happening. And collect everything from pg_locks - the lock dependencies
may easily be more complex.

It's also a good idea to share the deadlock log message, and info what
the processes were doing.

>
> #5 It is production , We will not able to attach process to gdb.
>

OK. Assuming it's a locking issue, we should be able to debug it using
just the pg_locks data.

>
> #6 My main concern is not coming in pg_locks table.
>
> Is it advice to use update statement on pg_class.reloptions column
> rather than using ALTER TABLE SET option.
>

No. Touching the catalogs directly is pretty bad idea. The locking is
there for a reason.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #14938: ALTER TABLE hang/ poor performance

From

Tomas Vondra

Date:

11 December 2017, 05:35:42

On 12/06/2017 12:13 AM, Dipesh Kamdar wrote:
> Here is  few day back log msg.  Removing actual table name and SQL but
> keeping actual msg.
> 
> log file msg:
> 
> 2017-11-30 13:24:34.513 GMT]: LOG:  *duration: 49082304.526 ms*  execute
> <unnamed>: ALTER TABLE  *tablename1  *SET (autovacuum_enabled= FALSE ,
> toast.autovacuum_enabled=FALSE )
> [2017-11-30 13:24:35.517 GMT]: LOG:  *duration: 30496621.974 *ms  parse
> <unnamed>: SELECT  column list FROM tablename2 INNER JOIN *tablename1*
> ON condation1.
> [2017-11-30 13:37:10.106 GMT]: LOG:  duration: *31239025.847* ms 
> execute <unnamed>: ANALYZE
> 

Obviously, you have other long-running queries (e.g. the join query),
not just the ALTER TABLE.

> 
> We have following setting enable in postgres config file.
> *log_min_duration_statement = 300000    # 5min *
> *deadlock_timeout = 1s   # default*
> *

Perhaps try also "log_lock_waits = on".

> *
> database monitoring script is not reporting any process is waiting for
> lock.  In this case deadlock is not occurring but it hangs almost 12
> hours. I am seeing this pattern once a day in log file with other table
> during same batch job run. Not any other SQL reported in log that say
> took more than 5 min.
> 
> Why don't we see deadlock? Why don't we see any lock waiting in pg_locks
> table ?
> 

Well, that's really hard to say without you showing us the pg_locks
contents (instead of just telling us there's nothing suspicious in it).
Also, pg_stat_activity collected at the same time would be useful.

Other than that, you can attach gdb to the waiting process, and see
where exactly it's waiting (collect backtrace using "bt").


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services