Thread: Rapid disk usage spikes when updating large tables with GIN indexes

Rapid disk usage spikes when updating large tables with GIN indexes

From
Jonathan Marks
Date:
Hello!

We have a mid-sized database on RDS running 10.1 (32 cores, 240 GB RAM, 5TB total disk space, 20k PIOPS) with several
large(100GB+, tens of millions of rows) tables that use GIN indexes for full-text search. We at times need to index
verylarge (hundreds of pages) documents and as a result our tables have a mix of small (tens of tokens) to very large
(hundredsof thousands of tokens near to the tsvector 1MB limit). All our GIN indexes have fastupdate turned off — we
foundthat turning fastupdate on led to significant blocking and that we get better average performance with it turned
off.We’ve put a lot of effort into tuning our database over the last several years to the point where we have
acceptableread and write performance for these tables.  

One recurring, and predictable, issue that we have experienced regularly for multiple years is that inserting or
updatingrows in any table with GIN indexes results in extremely large drops in free disk space — i.e. inserting 10k
rowswith a total size of 10GB can result in the temporary loss of several hundred gigabytes of free disk space over 2-3
hours(and it could be more — we try to keep a 10-15% buffer of free disk space so that often represents almost all
availabledisk space). Once we stop the operation, free disk space rapidly recovers, which makes us believe that this
occursdue to logs, or some kind of temporary table. Our work_mem and maintenance_work_mem settings are pretty large
(12GBand 62GB, respectively). The database’s size on disk scarcely budges during this process. 

Unfortunately, we’re on RDS, so we’re unable to ssh directly into the instance to see what files are so large, and none
ofthe logs we can see (nor the wal logs) are large enough to explain this process. Any suggestions about where to look
tosee the cause of this problem (or about any settings we can tune or changes we could make to stop it) would be
greatlyappreciated. 

Thank you!

Jonathan Marks <jonathanaverymarks@gmail.com> writes:
> One recurring, and predictable, issue that we have experienced regularly for multiple years is that inserting or
updatingrows in any table with GIN indexes results in extremely large drops in free disk space — i.e. inserting 10k
rowswith a total size of 10GB can result in the temporary loss of several hundred gigabytes of free disk space over 2-3
hours(and it could be more — we try to keep a 10-15% buffer of free disk space so that often represents almost all
availabledisk space). Once we stop the operation, free disk space rapidly recovers, which makes us believe that this
occursdue to logs, or some kind of temporary table. Our work_mem and maintenance_work_mem settings are pretty large
(12GBand 62GB, respectively). The database’s size on disk scarcely budges during this process. 

I'm not following exactly what you mean by "the database’s size on
disk scarcely budges" --- how does that square with the free disk space
dropping?  (IOW, what are you measuring?)

If you're not including WAL space in the "database size", then perhaps
a plausible theory is that the space consumption comes from a burst of
WAL output, and that the space is freed after the WAL has been dumped
off to secondary servers or archived or whatever you do with it.
If you do none of those things, it'd reduce to being an issue of how
long till the next checkpoint.

Assuming this theory is accurate, probably your use of fastupdate = off
is a contributing factor, as that causes a lot more "churn" in the
internals of the GIN indexes during updates, and correspondingly more
WAL output to log the changes.  But you probably don't want to revisit
that decision if you're happy with performance otherwise.

If you are archiving or streaming WAL, then probably what you want to
look at is getting rid of bottlenecks in that, so that it can keep up
with these WAL-generation spikes better.

If you're not, the only suggestion I can think of is to try twiddling
your checkpoint parameters to alleviate the space spikes.  Reducing
the checkpoint interval would do that, but you have to be very wary
of going too far; a short checkpoint interval results in more full-page
images being emitted to WAL and thus can actually increase your WAL
space consumption.  Depending on what parameters you're using now,
maybe even an increase would be better.

            regards, tom lane


[ please keep the list cc'd ]

Jonathan Marks <jonathanaverymarks@gmail.com> writes:
> Thanks for your quick reply. Here’s a bit more information:
> 1) to measure the “size of the database” we run something like `select datname,
pg_size_pretty(pg_database_size(datname))from pg_database;` I’m not sure if this includes WAL size. 
> 2) I’ve tried measuring WAL size with `select sum(size) from pg_ls_waldir();` — this also doesn’t budge.
> 3) Our current checkpoint_timeout is 600s with a checkpoint_completion_target of 0.9 — what does that suggest?

Hmph.  Your WAL-size query seems on point, and that pretty much destroys
my idea about a WAL emission spike.

pg_database_size() should include all regular and temporary tables/indexes
in the named DB.  It doesn't include WAL (but we've eliminated that), nor
cluster-wide tables such as pg_database (but those seem pretty unlikely
to be at issue), nor non-relation temporary files such as sort/hash temp
space.  At this point I think we have to focus our attention on what might
be creating large temp files.  I do not see anything in the GIN index code
that could do that, especially not if you have fastupdate off.  I wonder
whether there is something about the particular bulk-insertion queries
you're using that could result in large temp files --- which'd make the
apparent correlation with GIN index use a mirage, but we're running out
of other ideas.  You could try enabling log_temp_files to see if there's
anything to that.

In the grasping-at-straws department: are you quite sure that the extra
disk space consumption is PG's to begin with, rather than something
outside the database entirely?

            regards, tom lane


Re: Rapid disk usage spikes when updating large tables with GINindexes

From
Jonathan Marks
Date:
We’ll turn on log_temp_files and get back to you to see if that’s the cause. Re: the exact queries — these are just
normalINSERTs and UPDATEs. This occurs as part of normal database operations — i.e., we are processing 10% of a table
andmarking changes to a particular row, or happen to be inserting 5-10% of the table volume with new rows. Whenever we
bulkload we have to drop the indexes because the disk space loss just isn’t tenable. 

Re: extra disk space consumption not within PG — the AWS folks can’t tell me what the problem is because it’s all
internalto the PG part of the instance they can’t access. Doesn’t mean your last suggestion can’t be the case but makes
itslightly less likely. 

Any chance that GIN indexes are double-logging? I.e. with fastupdate off they are still trying to keep track of the
changesin the pending list or something? 

Our thought has been temp files for a while, but we’re not sure what we should do if that turns out to be the case.

> On May 14, 2018, at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> [ please keep the list cc'd ]
>
> Jonathan Marks <jonathanaverymarks@gmail.com> writes:
>> Thanks for your quick reply. Here’s a bit more information:
>> 1) to measure the “size of the database” we run something like `select datname,
pg_size_pretty(pg_database_size(datname))from pg_database;` I’m not sure if this includes WAL size. 
>> 2) I’ve tried measuring WAL size with `select sum(size) from pg_ls_waldir();` — this also doesn’t budge.
>> 3) Our current checkpoint_timeout is 600s with a checkpoint_completion_target of 0.9 — what does that suggest?
>
> Hmph.  Your WAL-size query seems on point, and that pretty much destroys
> my idea about a WAL emission spike.
>
> pg_database_size() should include all regular and temporary tables/indexes
> in the named DB.  It doesn't include WAL (but we've eliminated that), nor
> cluster-wide tables such as pg_database (but those seem pretty unlikely
> to be at issue), nor non-relation temporary files such as sort/hash temp
> space.  At this point I think we have to focus our attention on what might
> be creating large temp files.  I do not see anything in the GIN index code
> that could do that, especially not if you have fastupdate off.  I wonder
> whether there is something about the particular bulk-insertion queries
> you're using that could result in large temp files --- which'd make the
> apparent correlation with GIN index use a mirage, but we're running out
> of other ideas.  You could try enabling log_temp_files to see if there's
> anything to that.
>
> In the grasping-at-straws department: are you quite sure that the extra
> disk space consumption is PG's to begin with, rather than something
> outside the database entirely?
>
>             regards, tom lane



Re: Rapid disk usage spikes when updating large tables with GINindexes

From
Jonathan Marks
Date:
Hi Tom —

We turned on log_temp_files and since the last stats reset (about a week ago) we’re seeing 0 temp files altogether
(grabbingthat info from pg_stat_database). 

So, as far as we know:

1) It’s not WAL
2) It’s not tempfiles
3) It’s not the size of the error logs
4) It’s not the size of the actual rows in the database or the indexes

Another thread we found suggested pg_subtrans — this seems less likely because we’ve been able to replicate this across
manydifferent types of connections etc. but thought it might be a potential source. 

Any other system-monitoring queries that we can run that might further illuminate the issue?

Thank you!

> On May 14, 2018, at 3:31 PM, Jonathan Marks <jonathanaverymarks@gmail.com> wrote:
>
> We’ll turn on log_temp_files and get back to you to see if that’s the cause. Re: the exact queries — these are just
normalINSERTs and UPDATEs. This occurs as part of normal database operations — i.e., we are processing 10% of a table
andmarking changes to a particular row, or happen to be inserting 5-10% of the table volume with new rows. Whenever we
bulkload we have to drop the indexes because the disk space loss just isn’t tenable. 
>
> Re: extra disk space consumption not within PG — the AWS folks can’t tell me what the problem is because it’s all
internalto the PG part of the instance they can’t access. Doesn’t mean your last suggestion can’t be the case but makes
itslightly less likely. 
>
> Any chance that GIN indexes are double-logging? I.e. with fastupdate off they are still trying to keep track of the
changesin the pending list or something? 
>
> Our thought has been temp files for a while, but we’re not sure what we should do if that turns out to be the case.
>
>> On May 14, 2018, at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> [ please keep the list cc'd ]
>>
>> Jonathan Marks <jonathanaverymarks@gmail.com> writes:
>>> Thanks for your quick reply. Here’s a bit more information:
>>> 1) to measure the “size of the database” we run something like `select datname,
pg_size_pretty(pg_database_size(datname))from pg_database;` I’m not sure if this includes WAL size. 
>>> 2) I’ve tried measuring WAL size with `select sum(size) from pg_ls_waldir();` — this also doesn’t budge.
>>> 3) Our current checkpoint_timeout is 600s with a checkpoint_completion_target of 0.9 — what does that suggest?
>>
>> Hmph.  Your WAL-size query seems on point, and that pretty much destroys
>> my idea about a WAL emission spike.
>>
>> pg_database_size() should include all regular and temporary tables/indexes
>> in the named DB.  It doesn't include WAL (but we've eliminated that), nor
>> cluster-wide tables such as pg_database (but those seem pretty unlikely
>> to be at issue), nor non-relation temporary files such as sort/hash temp
>> space.  At this point I think we have to focus our attention on what might
>> be creating large temp files.  I do not see anything in the GIN index code
>> that could do that, especially not if you have fastupdate off.  I wonder
>> whether there is something about the particular bulk-insertion queries
>> you're using that could result in large temp files --- which'd make the
>> apparent correlation with GIN index use a mirage, but we're running out
>> of other ideas.  You could try enabling log_temp_files to see if there's
>> anything to that.
>>
>> In the grasping-at-straws department: are you quite sure that the extra
>> disk space consumption is PG's to begin with, rather than something
>> outside the database entirely?
>>
>>             regards, tom lane
>



Jonathan Marks <jonathanaverymarks@gmail.com> writes:
> We turned on log_temp_files and since the last stats reset (about a week ago) we’re seeing 0 temp files altogether
(grabbingthat info from pg_stat_database). 

Hm.

> Another thread we found suggested pg_subtrans — this seems less likely because we’ve been able to replicate this
acrossmany different types of connections etc. but thought it might be a potential source. 

We're running out of other ideas, so maybe, but it's kind of hard to
credit large numbers of gigabytes going into pg_subtrans et al.
Still, you could easily adapt your WAL-size query to track the sizes
of other DB subdirectories and see if anything springs out.  Since
I'm really feeling a bit baffled at this point, I'd suggest watching
all of them:

pg_commit_ts/
pg_dynshmem/
pg_logical/mappings/
pg_logical/snapshots/
pg_multixact/members/
pg_multixact/offsets/
pg_logical/
pg_multixact/
pg_notify/
pg_replslot/
pg_serial/
pg_snapshots/
pg_stat/
pg_stat_tmp/
pg_subtrans/
pg_tblspc/
pg_twophase/
pg_wal/
pg_xact/

            regards, tom lane


Re: Rapid disk usage spikes when updating large tables with GIN indexes

From
Nikolay Samokhvalov
Date:
Why not set up a spot EC2 instance with Postgres 10.1, load database from a dump (yes you’ll need to create one from RDS because they don’t provide direct access to dumps/backups; probably you need to get only specific tables) and repeat your actions, closely looking at filesystem.

ср, 16 мая 2018 г. в 13:10, Jonathan Marks <jonathanaverymarks@gmail.com>:
Hi Tom —

We turned on log_temp_files and since the last stats reset (about a week ago) we’re seeing 0 temp files altogether (grabbing that info from pg_stat_database).

So, as far as we know:

1) It’s not WAL
2) It’s not tempfiles
3) It’s not the size of the error logs
4) It’s not the size of the actual rows in the database or the indexes

Another thread we found suggested pg_subtrans — this seems less likely because we’ve been able to replicate this across many different types of connections etc. but thought it might be a potential source.

Any other system-monitoring queries that we can run that might further illuminate the issue?

Thank you!

> On May 14, 2018, at 3:31 PM, Jonathan Marks <jonathanaverymarks@gmail.com> wrote:
>
> We’ll turn on log_temp_files and get back to you to see if that’s the cause. Re: the exact queries — these are just normal INSERTs and UPDATEs. This occurs as part of normal database operations — i.e., we are processing 10% of a table and marking changes to a particular row, or happen to be inserting 5-10% of the table volume with new rows. Whenever we bulk load we have to drop the indexes because the disk space loss just isn’t tenable.
>
> Re: extra disk space consumption not within PG — the AWS folks can’t tell me what the problem is because it’s all internal to the PG part of the instance they can’t access. Doesn’t mean your last suggestion can’t be the case but makes it slightly less likely.
>
> Any chance that GIN indexes are double-logging? I.e. with fastupdate off they are still trying to keep track of the changes in the pending list or something?
>
> Our thought has been temp files for a while, but we’re not sure what we should do if that turns out to be the case.
>
>> On May 14, 2018, at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> [ please keep the list cc'd ]
>>
>> Jonathan Marks <jonathanaverymarks@gmail.com> writes:
>>> Thanks for your quick reply. Here’s a bit more information:
>>> 1) to measure the “size of the database” we run something like `select datname, pg_size_pretty(pg_database_size(datname)) from pg_database;` I’m not sure if this includes WAL size.
>>> 2) I’ve tried measuring WAL size with `select sum(size) from pg_ls_waldir();` — this also doesn’t budge.
>>> 3) Our current checkpoint_timeout is 600s with a checkpoint_completion_target of 0.9 — what does that suggest?
>>
>> Hmph.  Your WAL-size query seems on point, and that pretty much destroys
>> my idea about a WAL emission spike.
>>
>> pg_database_size() should include all regular and temporary tables/indexes
>> in the named DB.  It doesn't include WAL (but we've eliminated that), nor
>> cluster-wide tables such as pg_database (but those seem pretty unlikely
>> to be at issue), nor non-relation temporary files such as sort/hash temp
>> space.  At this point I think we have to focus our attention on what might
>> be creating large temp files.  I do not see anything in the GIN index code
>> that could do that, especially not if you have fastupdate off.  I wonder
>> whether there is something about the particular bulk-insertion queries
>> you're using that could result in large temp files --- which'd make the
>> apparent correlation with GIN index use a mirage, but we're running out
>> of other ideas.  You could try enabling log_temp_files to see if there's
>> anything to that.
>>
>> In the grasping-at-straws department: are you quite sure that the extra
>> disk space consumption is PG's to begin with, rather than something
>> outside the database entirely?
>>
>>                      regards, tom lane
>


Re: Rapid disk usage spikes when updating large tables with GINindexes

From
Jonathan Marks
Date:
I believe that we have figured it out. It indeed was a WAL issue — the WAL wasn’t getting measured because it had been moved into an archived folder.

We resolved this in a two main ways:

1. By dramatically increasing max_wal_size to decrease the frequency of commits
2. By turning on wal_compression

We’re going to continue looking into whether we can reduce wal generation / write volume further by turning back on fast_update on our GIN indexes.

On May 16, 2018, at 4:31 PM, Nikolay Samokhvalov <samokhvalov@gmail.com> wrote:

Why not set up a spot EC2 instance with Postgres 10.1, load database from a dump (yes you’ll need to create one from RDS because they don’t provide direct access to dumps/backups; probably you need to get only specific tables) and repeat your actions, closely looking at filesystem.

ср, 16 мая 2018 г. в 13:10, Jonathan Marks <jonathanaverymarks@gmail.com>:
Hi Tom —

We turned on log_temp_files and since the last stats reset (about a week ago) we’re seeing 0 temp files altogether (grabbing that info from pg_stat_database).

So, as far as we know:

1) It’s not WAL
2) It’s not tempfiles
3) It’s not the size of the error logs
4) It’s not the size of the actual rows in the database or the indexes

Another thread we found suggested pg_subtrans — this seems less likely because we’ve been able to replicate this across many different types of connections etc. but thought it might be a potential source.

Any other system-monitoring queries that we can run that might further illuminate the issue?

Thank you!

> On May 14, 2018, at 3:31 PM, Jonathan Marks <jonathanaverymarks@gmail.com> wrote:
>
> We’ll turn on log_temp_files and get back to you to see if that’s the cause. Re: the exact queries — these are just normal INSERTs and UPDATEs. This occurs as part of normal database operations — i.e., we are processing 10% of a table and marking changes to a particular row, or happen to be inserting 5-10% of the table volume with new rows. Whenever we bulk load we have to drop the indexes because the disk space loss just isn’t tenable.
>
> Re: extra disk space consumption not within PG — the AWS folks can’t tell me what the problem is because it’s all internal to the PG part of the instance they can’t access. Doesn’t mean your last suggestion can’t be the case but makes it slightly less likely.
>
> Any chance that GIN indexes are double-logging? I.e. with fastupdate off they are still trying to keep track of the changes in the pending list or something?
>
> Our thought has been temp files for a while, but we’re not sure what we should do if that turns out to be the case.
>
>> On May 14, 2018, at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> [ please keep the list cc'd ]
>>
>> Jonathan Marks <jonathanaverymarks@gmail.com> writes:
>>> Thanks for your quick reply. Here’s a bit more information:
>>> 1) to measure the “size of the database” we run something like `select datname, pg_size_pretty(pg_database_size(datname)) from pg_database;` I’m not sure if this includes WAL size.
>>> 2) I’ve tried measuring WAL size with `select sum(size) from pg_ls_waldir();` — this also doesn’t budge.
>>> 3) Our current checkpoint_timeout is 600s with a checkpoint_completion_target of 0.9 — what does that suggest?
>>
>> Hmph.  Your WAL-size query seems on point, and that pretty much destroys
>> my idea about a WAL emission spike.
>>
>> pg_database_size() should include all regular and temporary tables/indexes
>> in the named DB.  It doesn't include WAL (but we've eliminated that), nor
>> cluster-wide tables such as pg_database (but those seem pretty unlikely
>> to be at issue), nor non-relation temporary files such as sort/hash temp
>> space.  At this point I think we have to focus our attention on what might
>> be creating large temp files.  I do not see anything in the GIN index code
>> that could do that, especially not if you have fastupdate off.  I wonder
>> whether there is something about the particular bulk-insertion queries
>> you're using that could result in large temp files --- which'd make the
>> apparent correlation with GIN index use a mirage, but we're running out
>> of other ideas.  You could try enabling log_temp_files to see if there's
>> anything to that.
>>
>> In the grasping-at-straws department: are you quite sure that the extra
>> disk space consumption is PG's to begin with, rather than something
>> outside the database entirely?
>>
>>                      regards, tom lane
>