Thread: Slow WAL recovery for DROP TABLE

Slow WAL recovery for DROP TABLE

From
Sherrylyn Branchaw
Date:
We are running Postgres 9.6.9 on RHEL 6.9. We're using built-in streaming replication, with a WAL archive for fallback purposes. No logical replication.

We recently had a bug in our code accidentally create several hundred thousand tables in a single database. A batch job started cleaning them up one at a time (separate transactions), over the course of a couple nights.

But what we noticed the next day was that the 2 standbys to this database were falling further and further behind. They were no longer streaming, instead they were recovering from the archive, and they were doing so very slowly.

Only a small amount of WAL was generated by the DROP TABLEs, relative to normal traffic, but the time to recover a single WAL file sometimes took over a minute, compared to 1 second normally. Meanwhile, new WALs were being generated by regular user traffic, and the database got 24 hours behind and getting worse, before we resynced the standbys, in order to not have to replay the WALs. That did the trick. But it seemed like this performance drop was drastic enough to be worth reporting.

I was able to reproduce the slow recovery time of WALs generated by DROP TABLE on a pair of test servers.

Normal writes (INSERT, UPDATE, DELETE, CREATE TABLE) generated WAL files that were recoverable on the order of 1 second each. The standby never fell behind, even when hundreds of WAL files were generated quickly. But when I generated WALs exclusively from DROP TABLE, it took 20-80 seconds to recover each one.

Some benchmarks:

For 10 minutes, I issued a bunch of rapid-fire INSERT statements to server1, the primary. These generated 144 WAL files. At the end of the 10 minutes, server2, the standby, had already recovered 143 of them and immediately completed the last one.

For 10 minutes, I issued a bunch of rapid-fire DROP TABLE statements to server1, the primary. These generated 28 WAL files. At the end of the 10 minutes, server2, the standby, had only recovered 16 of them.

My expectation is that replaying binary changes on a standby should be at least as fast and sometimes faster than generating them locally. This is because when a logical command like "DROP TABLE" or "INSERT" is issued on a primary, the database has to check for dependencies, satisfy constraints, etc. But once all those decisions are made on the primary, the binary changes should be replayable much faster on the standby.

For example, server1, the primary in my test pair, has significantly better hardware than server2, the standby. If server2 is keeping up with server1 in replication, I predict that server2 should be able to get through the same amount of writes faster as a replica than as a standalone.

This expectation was borne out in my testing for INSERTs but not DROP TABLEs, per the following benchmarks.

I issued 9,000,000 INSERTs to server1, the primary. It took 10 minutes to complete these. At the end of the 10 minutes, server2, the standby, was caught up. When I made server2 a standalone, it took 15 minutes to get through the same 9,000,000 INSERTs. In conclusion, being a standalone was 50% slower for server2 when it came to INSERT.

I issued 100,000 DROP TABLEs to server1, the primary. It took 10 minutes to complete these on server 1, and 20 minutes for server2, the standby, to catch up. When I made server2 a standalone, it took 18 minutes to get through the same 100,000 DROP TABLEs. In conclusion, being a standalone was 10% faster for server2 when it came to DROP TABLE.

It seems that there is something especially inefficient about WAL files generated by DROP TABLE, or the process of recovering them.

I also want to mention that recovering a WAL file generated by DROP TABLE commands consistently takes approximately 100% CPU on the standby, both in testing and in prod. In prod, we also saw high memory use for that process.

The standby with 128 GB of RAM showed this:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                
23279 postgres  20   0 57.0g  50g 5.1g R 100.0 40.0   1723:15 postgres: startup process   recovering 00000002000003B10000008D

The standby with 32 GB of RAM crashed overnight with an out-of-memory error.

When I restarted the servers and let them resume recovering WALs, I watched %MEM and RES grow slowly and steadily before I gave up and resynced. No memory appeared to be freed up after each WAL file was replayed.

I was not able to reproduce memory usage getting quite this high in the time I was allotted for testing, but I did witness steady climbing up to almost 20%, especially when I mixed DROP TABLEs with regular writes. I suspect there is a sweet (or bitter?) spot of write activity we hit in our prod workload that I was unable to reproduce in testing.

I'm not sure if this is the right mailing list to report this behavior to, but I thought it was the best place to start. Let me know if there's any other information I should provide, or if I should send it to another list.

Thanks,
Sherrylyn

Re: Slow WAL recovery for DROP TABLE

From
Jeff Janes
Date:

There was a recent commit for a similar performance problem, which will appear in 9.6.10.  But that was specifically for cases where there were multiple dropped tables per transaction, and large shared_buffers.

I can't reproduce your single-drop-per-transaction problem.  The replica has no problem keeping up with the master.

Can you share the reproduction scripts, and any non-default config settings?  Especially the setting of shared_buffers (on both master and replica, if different)

Cheers,

Jeff

RE: Slow WAL recovery for DROP TABLE

From
"Jamison, Kirk"
Date:

Hi, I have also reported a similar problem in the hackers mailing list, but particularly on TRUNCATE TABLE.
https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C62FD6E6%40g01jpexmbkw24


The problem lies with the standby server’s replay as it does separate scanning of the whole shared buffer for each DROP/TRUNCATE TABLE in order to check if the table-to-delete is cached in shared buffer. Therefore, it will take a long recovery time and sometimes fail for large tables depending on shared_buffer size.

The main problem here is the scanning of shared_buffers, which not only affects drop/truncate table, but also drop database and vacuum as well. Other developers have contributed ideas and have been thinking of developing a buffer mapping implementation for PG12 that will solve all these, but it’s still a long shot from now.
But I think any working minor solutions/fixes from developers are also welcome, such as the recent committed patch for the multiple dropped tables per transaction with large shared_buffers.

Regards,
Kirk Jamison

 

From: Sherrylyn Branchaw [mailto:sbranchaw@gmail.com]
Sent: Tuesday, July 17, 2018 11:41 PM
To: pgsql-general@postgresql.org
Subject: Slow WAL recovery for DROP TABLE

 

We are running Postgres 9.6.9 on RHEL 6.9. We're using built-in streaming replication, with a WAL archive for fallback purposes. No logical replication.

 

We recently had a bug in our code accidentally create several hundred thousand tables in a single database. A batch job started cleaning them up one at a time (separate transactions), over the course of a couple nights.


But what we noticed the next day was that the 2 standbys to this database were falling further and further behind. They were no longer streaming, instead they were recovering from the archive, and they were doing so very slowly.

 

Only a small amount of WAL was generated by the DROP TABLEs, relative to normal traffic, but the time to recover a single WAL file sometimes took over a minute, compared to 1 second normally. Meanwhile, new WALs were being generated by regular user traffic, and the database got 24 hours behind and getting worse, before we resynced the standbys, in order to not have to replay the WALs. That did the trick. But it seemed like this performance drop was drastic enough to be worth reporting.

 

I was able to reproduce the slow recovery time of WALs generated by DROP TABLE on a pair of test servers.

 

Normal writes (INSERT, UPDATE, DELETE, CREATE TABLE) generated WAL files that were recoverable on the order of 1 second each. The standby never fell behind, even when hundreds of WAL files were generated quickly. But when I generated WALs exclusively from DROP TABLE, it took 20-80 seconds to recover each one.

 

Some benchmarks:

 

For 10 minutes, I issued a bunch of rapid-fire INSERT statements to server1, the primary. These generated 144 WAL files. At the end of the 10 minutes, server2, the standby, had already recovered 143 of them and immediately completed the last one.

 

For 10 minutes, I issued a bunch of rapid-fire DROP TABLE statements to server1, the primary. These generated 28 WAL files. At the end of the 10 minutes, server2, the standby, had only recovered 16 of them.

 

My expectation is that replaying binary changes on a standby should be at least as fast and sometimes faster than generating them locally. This is because when a logical command like "DROP TABLE" or "INSERT" is issued on a primary, the database has to check for dependencies, satisfy constraints, etc. But once all those decisions are made on the primary, the binary changes should be replayable much faster on the standby.

 

For example, server1, the primary in my test pair, has significantly better hardware than server2, the standby. If server2 is keeping up with server1 in replication, I predict that server2 should be able to get through the same amount of writes faster as a replica than as a standalone.

 

This expectation was borne out in my testing for INSERTs but not DROP TABLEs, per the following benchmarks.

I issued 9,000,000 INSERTs to server1, the primary. It took 10 minutes to complete these. At the end of the 10 minutes, server2, the standby, was caught up. When I made server2 a standalone, it took 15 minutes to get through the same 9,000,000 INSERTs. In conclusion, being a standalone was 50% slower for server2 when it came to INSERT.

 

I issued 100,000 DROP TABLEs to server1, the primary. It took 10 minutes to complete these on server 1, and 20 minutes for server2, the standby, to catch up. When I made server2 a standalone, it took 18 minutes to get through the same 100,000 DROP TABLEs. In conclusion, being a standalone was 10% faster for server2 when it came to DROP TABLE.

 

It seems that there is something especially inefficient about WAL files generated by DROP TABLE, or the process of recovering them.

 

I also want to mention that recovering a WAL file generated by DROP TABLE commands consistently takes approximately 100% CPU on the standby, both in testing and in prod. In prod, we also saw high memory use for that process.

 

The standby with 128 GB of RAM showed this:

 

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                

23279 postgres  20   0 57.0g  50g 5.1g R 100.0 40.0   1723:15 postgres: startup process   recovering 00000002000003B10000008D

 

The standby with 32 GB of RAM crashed overnight with an out-of-memory error.

 

When I restarted the servers and let them resume recovering WALs, I watched %MEM and RES grow slowly and steadily before I gave up and resynced. No memory appeared to be freed up after each WAL file was replayed.

 

I was not able to reproduce memory usage getting quite this high in the time I was allotted for testing, but I did witness steady climbing up to almost 20%, especially when I mixed DROP TABLEs with regular writes. I suspect there is a sweet (or bitter?) spot of write activity we hit in our prod workload that I was unable to reproduce in testing.

 

I'm not sure if this is the right mailing list to report this behavior to, but I thought it was the best place to start. Let me know if there's any other information I should provide, or if I should send it to another list.

 

Thanks,

Sherrylyn

Re: Slow WAL recovery for DROP TABLE

From
Sherrylyn Branchaw
Date:
> There was a recent commit for a similar performance problem, which will appear in 9.6.10.  But that was specifically for cases where there were multiple dropped tables per transaction, and large shared_buffers.

Interesting, and good to know, thanks! I'm not sure we fall under either (is 8 GB large? It's larger than the default, but I always thought large was defined as "more than 8GB" for this setting), but it sounds like this sort of problem is on the developers' radars. It's possible up to 4 tables were dropped per transaction in prod, but I don't know if that's enough to count as "multiple", and in testing, I reproduced the problem with 1 drop per transaction.

> I can't reproduce your single-drop-per-transaction problem.  The replica has no problem keeping up with the master.

It's possible that the problem only occurs when the replica is on inferior hardware. I was unable to test equal servers in the time I had. I noticed that when the superior server was the replica, it was able to keep up with the inferior replica, but that dropping tables was the only action for which the inferior server wasn't able to keep up with as a standby, and the only action for which the standalone outperformed the replica. I did not test truncates; it's possible I would have seen the same problem with it.

> Can you share the reproduction scripts

For the table drops, I prepped by running these:

CREATE TABLE IF NOT EXISTS test1 (id int);
CREATE TABLE IF NOT EXISTS test2 (id int);
CREATE TABLE IF NOT EXISTS test3 (id int);
...
CREATE TABLE IF NOT EXISTS test100000 (id int);

Then I dropped with these:

DROP TABLE IF EXISTS test1;
DROP TABLE IF EXISTS test2;
DROP TABLE IF EXISTS test3;
...
DROP TABLE IF EXISTS test100000;

For the inserts, I did a

CREATE TABLE test (content1 TEXT, content2 TEXT, content3 TEXT);

followed by 300 of these statements, which insert a random string into each of the three columns, in batches of 10000 rows:

INSERT INTO test SELECT array_to_string(ARRAY(SELECT chr((48 + round(random() * 9)) :: integer) FROM generate_series(1,135)), ''), array_to_string(ARRAY(SELECT chr((48 + round(random() * 9)) :: integer) FROM generate_series(1,135)), ''), array_to_string(ARRAY(SELECT chr((48 + round(random() * 9)) :: integer) FROM generate_series(1,135)), '') FROM (SELECT generate_series(1,10000)) foo;

I loaded and timed each script as a file using 

time psql -d mydb -f drop_tables.sql

> any non-default config settings?  Especially the setting of shared_buffers (on both master and replica, if different)

Our shared_buffers setting is 8 GB on all nodes.

In prod, we had a primary and two standbys. One standby was very similar to the primary in hardware, but older and with fewer CPUs. The other standby was far inferior in hardware. For testing, I used the two standbys, and was able to reproduce when I made the superior standby the primary.

These are the non-default settings on the primary and the standby of comparable hardware (other than changing file and directory paths, which shouldn't affect anything):

listen_addresses = '*'
log_destination = 'stderr, syslog'
log_filename = 'postgresql-%Y-%m-%d.log'
log_line_prefix = '%p [%m]: %u %d %q %h %a %v %x'
syslog_facility = 'local1'
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_timezone = 'US/Central'
log_statement = 'ddl'
track_functions = pl
track_activity_query_size = 4096
timezone = US/Central
client_encoding = UTF8
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
max_connections = 1024
tcp_keepalives_idle = 7200
tcp_keepalives_interval = 75
tcp_keepalives_count = 9
work_mem                        = '48 MB'
maintenance_work_mem            = '1 GB'
max_locks_per_transaction = 8192
random_page_cost = 2.0
effective_cache_size = '94GB'
log_autovacuum_min_duration = 10s
autovacuum_naptime = 2min
autovacuum_vacuum_threshold = 500
autovacuum_analyze_threshold = 250
autovacuum_freeze_max_age = 1000000000
autovacuum_vacuum_cost_delay = 10ms
autovacuum_vacuum_scale_factor = 0.01
autovacuum_analyze_scale_factor = 0.01
wal_buffers = 16MB
bgwriter_lru_maxpages = 250
max_wal_senders = 5
wal_keep_segments = 256
hot_standby = on
log_min_duration_statement = 2s
wal_level = replica
wal_log_hints = on
archive_mode = on
archive_command = 'ssh postgres@backupserver "test ! -f /opt/backup/db1/%f" && rsync -q %p postgres@backupserver:/opt/backup/db1/%f'
max_parallel_workers_per_gather = 4
max_worker_processes = 16

The only different setting on the second standby is effective_cache_size, which is 24 GB instead of 94 GB.

To rule out fetching from the remote archive as the bottleneck, I tried scp-ing a bunch of WAL files to the standby and resetting the restore command to replay from the local archive. Same performance problem, and only when dropping tables.

Best,
Sherrylyn

Re: Slow WAL recovery for DROP TABLE

From
Sherrylyn Branchaw
Date:
Hi, I have also reported a similar problem in the hackers mailing list, but particularly on TRUNCATE TABLE.
https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C62FD6E6%40g01jpexmbkw24

Ooh, interesting. I admit I did not include TRUNCATE in my testing.

> The problem lies with the standby server’s replay as it does separate scanning of the whole shared buffer for each DROP/TRUNCATE TABLE in order to check if the table-to-delete is cached in shared buffer. Therefore, it will take a long recovery time and sometimes fail for large tables depending on shared_buffer size.

Also very interesting. We only (?) have 8 GB of shared buffers, and I see from your message that you had 300 GB. All of our tables, both in prod and in my reproduction, were empty, but there were hundreds of thousands of them.

> The main problem here is the scanning of shared_buffers, which not only affects drop/truncate table, but also drop database and vacuum as well.

I wondered about that. I didn't have any problem with a single drop database, but the database dropped was a small one (albeit one with a few hundred thousand empty tables), and I neither tested dropping a large database nor dropping 100,000 databases. I didn't test vacuuming, but we do heavy vacuuming on all our primaries frequently, and...hmm. Regular vacuuming doesn't cause any problems that have made it onto my radar, but VACUUM FULLs can cause WAL files to pile up on the primary's pg_xlog before getting archived. I never investigated that, just throttled my VACUUM FULLs, because they're only ever run manually. I will keep an eye on the recovery time of individual files the next time I have to do this, which will probably be soon.

> But I think any working minor solutions/fixes from developers are also welcome, such as the recent committed patch for the multiple dropped tables per transaction with large shared_buffers.

Agreed. Should I have sent or should I still send this to pgsql-hackers? I wasn't sure, so I erred on the side of not bothering the developers until I'd gotten some feedback here.

Best,
Sherrylyn