BUG #15778: Replication slot peak query cause memory spike at the server when big transaction are running - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #15778: Replication slot peak query cause memory spike at the server when big transaction are running
Date
Msg-id 15778-0d88c55df57eb082@postgresql.org
Whole thread Raw
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      15778
Logged by:          Nitesh Yadav
Email address:      nitesh@datacoral.co
PostgreSQL version: 9.6.2
Operating system:   x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.3 2
Description:

Hi, 

Postgres Server setup: 
  Postgres server is running as AWS rds instance. 
  Server Version is PostgreSQL 9.6.2 on x86_64-pc-linux-gnu, compiled by gcc
(GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit
  With the following parameters group rds.logical_replication is set to
1.Which internally set the following flags: wal_level, max_wal_senders,
max_replication_slots, max_connections.
  We are using test_decoding module for retrieving/read the WAL data through
the logical decoding mechanism.

Application setup: 
  Periodically we run the peek command to retrieve the data from the slot:
eg SELECT * FROM pg_logical_slot_peek_changes('pgldpublic_cdc_slot', NULL,
NULL, 'include-timestamp', 'on') LIMIT 200000 OFFSET 0; 
  From the above query result, we use location of last transaction to remove
the data from the slot: 
eg SELECT location, xid FROM
pg_logical_slot_get_changes('pgldpublic_cdc_slot', 'B92/C7394678',
NULL,'include-timestamp', 'on') LIMIT 1; 
  We run Step 1 & 2 in the loop for reading data in the chunk of 200K
records at a time in a given process. 

------------------------------------------------------------------------------
Behavior reported (Bug): 
------------------------------------------------------------------------------

Client Side timeline: 
------------------------------------------------------------------------------
We have process with PID: 10032, which started at 2019-04-24T06:47:01.877Z
At 2019-04-24T06:47:01.893Z: It invoked the following peek query
SELECT * FROM pg_logical_slot_peek_changes('silo2_cdc_slot', NULL, NULL,
'include-timestamp', 'on') LIMIT 500000 OFFSET 0;
Till 2019-04-24T07:01:41.852ZL: There was no response from the server and
then we kill the process. 

Server side time: 
------------------------------------------------------------------------------
At the server we have large transaction going-on, from 2019-04-24T06:00:00Z.

At 2019-04-24T06:45:38Z, the slot size was around 2GB, at the moment the
query was taking around 20 sec to response. 
At 2019-04-24T06:47:00Z, the memory usage of pg server by the PID = 10032
started growing drastically
At 2019-04-24T06:54:00Z, the memory usage of pg server by the PID = 10032
came upto 20 GB  
At 2019-04-24T06:58:00Z, the memory usage of pg server by the PID = 10032
came upto 41 GB  
At 2019-04-24T07:05:40Z, the memory usage of pg server by the PID = 10032
stays around 40.98 GB   

Following are the anonymized pg server logs 
------------------------------------------------------------------------------
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(22837):abc@xyz:[6989]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:LOG:
starting logical decoding for slot "pg_slot"
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:DETAIL:
streaming transactions committing after 1003/499E5518, reading WAL from
FFB/F8F0F3E8
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:STATEMENT:
 SELECT * FROM pg_logical_slot_peek_changes('pg_slot', NULL, NULL,
'include-timestamp', 'on') LIMIT 500000 OFFSET 0;
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:LOG:
logical decoding found consistent point at FFB/F8F0F3E8
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:DETAIL:
Logical decoding will begin using saved snapshot.
2019-04-24 06:45:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:STATEMENT:
 SELECT * FROM pg_logical_slot_peek_changes('pg_slot', NULL, NULL,
'include-timestamp', 'on') LIMIT 500000 OFFSET 0;
2019-04-24 06:46:00
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:LOG:
duration: 22294.217 ms  execute <unnamed>: SELECT * FROM
pg_logical_slot_peek_changes('pg_slot', NULL, NULL, 'include-timestamp',
'on') LIMIT 500000 OFFSET 0;
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:LOG:
starting logical decoding for slot "pg_slot"
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:DETAIL:
streaming transactions committing after 1003/499E5518, reading WAL from
FFB/F8F0F3E8
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:STATEMENT:
 SELECT xid FROM pg_logical_slot_get_changes('pg_slot', '1003/9B28D200',
NULL, 'include-timestamp', 'on') LIMIT 1;
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:LOG:
logical decoding found consistent point at FFB/F8F0F3E8
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:DETAIL:
Logical decoding will begin using saved snapshot.
2019-04-24 06:46:38
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:STATEMENT:
 SELECT xid FROM pg_logical_slot_get_changes('pg_slot', '1003/9B28D200',
NULL, 'include-timestamp', 'on') LIMIT 1;
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(50260):abc@xyz:[6988]:LOG:
duration: 23156.543 ms  statement: SELECT xid FROM
pg_logical_slot_get_changes('pg_slot', '1003/9B28D200', NULL,
'include-timestamp', 'on') LIMIT 1;
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc@xyz:[10032]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(46673):abc@xyz:[10034]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc@xyz:[10032]:LOG:
starting logical decoding for slot "pg_slot"
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc@xyz:[10032]:DETAIL:
streaming transactions committing after 1003/9B28D200, reading WAL from
FFB/F8F0F3E8
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc@xyz:[10032]:STATEMENT:
 SELECT * FROM pg_logical_slot_peek_changes('pg_slot', NULL, NULL,
'include-timestamp', 'on') LIMIT 500000 OFFSET 0;
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc@xyz:[10032]:LOG:
logical decoding found consistent point at FFB/F8F0F3E8
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc@xyz:[10032]:DETAIL:
Logical decoding will begin using saved snapshot.
2019-04-24 06:47:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(10137):abc@xyz:[10032]:STATEMENT:
 SELECT * FROM pg_logical_slot_peek_changes('pg_slot', NULL, NULL,
'include-timestamp', 'on') LIMIT 500000 OFFSET 0;
2019-04-24 06:50:14
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(60004):abc@xyz:[18122]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 06:55:14
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(29409):abc@xyz:[31637]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 07:00:14
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(43301):abc@xyz:[5247]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 07:02:01
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(46673):abc@xyz:[10034]:LOG:
could not receive data from client: Connection reset by peer
2019-04-24 07:05:14
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(59862):abc@xyz:[10813]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)
2019-04-24 07:10:14
UTC:ec2-xx-xx-xx-xx.compute-1.amazonaws.com(15221):abc@xyz:[16183]:LOG:
connection authorized: user=abc database=xyz SSL enabled (protocol=TLSv1.2,
cipher=XXXXX-YYY-AES256-GCM-ABC123, compression=off)

We have the following questions: 
------------------------------------------------------------------------------
1) What causes the memory usage spike at the server? Is it the peek query or
write of big transaction running on the server ? 
2) Why the peek query took so much time at 2019-04-24T06:47:00Z, whereas it
was finished within 20s, when run last time at 2019-04-24T06:45: 38.516Z? 
3) How can we restrict/limit the memory usage of peek queries from the
server side? Please suggest if there are configuration we can use?  


Regards,
Nitesh


pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: amcheck assert failure
Next
From: Peter Geoghegan
Date:
Subject: Re: amcheck assert failure