Thread: Remove_temp_files_after_crash and significant recovery/startup time

Remove_temp_files_after_crash and significant recovery/startup time

From

"McCoy, Shawn"

Date:

10 September 2021, 23:58:20

I noticed that the new parameter remove_temp_files_after_crash is currently set to a default value of "true" in the version 14 release. It seems this was discussed in this thread [1], and it doesn't look to me like there's been a lot of stress testing of this feature.

In our fleet there have been cases where we have seen hundreds of thousands of temp files generated. I found a case where we helped a customer that had a little over 2.2 million temp files. Single threaded cleanup of these takes a significant amount of time and delays recovery. In RDS, we mitigated this by moving the pgsql_tmp directory aside, start the engine and then separately remove the old temp files.

After noticing the current plans to default this GUC to "on" in v14, just thought I'd raise the question of whether this should get a little more discussion or testing with higher numbers of temp files?

Regards,

Shawn McCoy

Database Engineer

Amazon Web Services

[1] https://www.postgresql.org/message-id/CAH503wDKdYzyq7U-QJqGn%3DGm6XmoK%2B6_6xTJ-Yn5WSvoHLY1Ww%40mail.gmail.com

Re: Remove_temp_files_after_crash and significant recovery/startup time

From

Tom Lane

Date:

11 September 2021, 00:32:58

"McCoy, Shawn" <shamccoy@amazon.com> writes:
> I noticed that the new parameter remove_temp_files_after_crash is currently set to a default value of "true" in the
version14 release. It seems this was discussed in this thread [1], and it doesn't look to me like there's been a lot of
stresstesting of this feature. 

Probably not ...

> In our fleet there have been cases where we have seen hundreds of thousands of temp files generated.  I found a case
wherewe helped a customer that had a little over 2.2 million temp files.  Single threaded cleanup of these takes a
significantamount of time and delays recovery. In RDS, we mitigated this by moving the pgsql_tmp directory aside, start
theengine and then separately remove the old temp files. 

TBH, I think the thing to be asking questions about is how come you had so
many temp files in the first place.  Sounds like something is misadjusted
somewhere.

            regards, tom lane

Re: Remove_temp_files_after_crash and significant recovery/startup time

From

Tomas Vondra

Date:

11 September 2021, 00:57:24

On 9/10/21 10:58 PM, McCoy, Shawn wrote:
> I noticed that the new parameter remove_temp_files_after_crash is 
> currently set to a default value of "true" in the version 14 release. It 
> seems this was discussed in this thread [1], and it doesn't look to me 
> like there's been a lot of stress testing of this feature.
> 

Not sure what could we learn from a stress test? IMHO it's fairly 
natural that if there are many temporary files and/or if deleting a file 
is expensive on a given filesystem, the cleanup may take time.

> In our fleet there have been cases where we have seen hundreds of 
> thousands of temp files generated.  I found a case where we helped a 
> customer that had a little over 2.2 million temp files.  Single threaded 
> cleanup of these takes a significant amount of time and delays recovery. 
> In RDS, we mitigated this by moving the pgsql_tmp directory aside, start 
> the engine and then separately remove the old temp files.
> 
> After noticing the current plans to default this GUC to "on" in v14, 
> just thought I'd raise the question of whether this should get a little 
> more discussion or testing with higher numbers of temp files?
> 

I doubt we can lean anything new from such testing.

Of course, we can discuss the default for the GUC. I see it as a trade 
off between risk of running out of disk space and increased recovery 
time, and perhaps the decision to prioritize lower risk of running out 
of disk space was not the right one ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Remove_temp_files_after_crash and significant recovery/startup time

From

"Euler Taveira"

Date:

11 September 2021, 01:10:00

On Fri, Sep 10, 2021, at 5:58 PM, McCoy, Shawn wrote:

I noticed that the new parameter remove_temp_files_after_crash is currently set to a default value of "true" in the version 14 release. It seems this was discussed in this thread [1], and it doesn't look to me like there's been a lot of stress testing of this feature.

In our fleet there have been cases where we have seen hundreds of thousands of temp files generated. I found a case where we helped a customer that had a little over 2.2 million temp files. Single threaded cleanup of these takes a significant amount of time and delays recovery. In RDS, we mitigated this by moving the pgsql_tmp directory aside, start the engine and then separately remove the old temp files.

2.2 million temporary files? I'm wondering in what circumstances your system is

generating those temporary files. Low work_mem and thousands of connections?

Low work_mem and a huge analytic query? When I designed this feature I thought

about some extreme cases, that's why this behavior is controlled by a GUC. We

can probably add a sentence that explains the recovery delay caused by dozens

of thousands of temporary files.

After noticing the current plans to default this GUC to "on" in v14, just thought I'd raise the question of whether this should get a little more discussion or testing with higher numbers of temp files?

Crash a backend is per se a rare condition (at least it should be). Crash while

having millions of temporary files in your PGDATA is an even rarer condition. I

saw several cases related to this issue and none of them generates millions of

temporary files (at most a thousand files). IMO the benefits outweigh the

issues as I explained in [1]. Service continuity (for the vast majority of

cases) justifies turning it on by default.

If your Postgres instance is generating millions of temporary files, it seems

your setup needs some tuning.

[1] https://www.postgresql.org/message-id/CAH503wDKdYzyq7U-QJqGn%3DGm6XmoK%2B6_6xTJ-Yn5WSvoHLY1Ww%40mail.gmail.com

Euler Taveira

EDB https://www.enterprisedb.com/

Re: Remove_temp_files_after_crash and significant recovery/startup time

From

Jeremy Schneider

Date:

16 September 2021, 00:24:35

On 9/10/21 14:57, Tomas Vondra wrote:
> On 9/10/21 10:58 PM, McCoy, Shawn wrote:
>> I noticed that the new parameter remove_temp_files_after_crash is
>> currently set to a default value of "true" in the version 14 release.
>> It seems this was discussed in this thread [1], and it doesn't look to
>> me like there's been a lot of stress testing of this feature.
> 
> Not sure what could we learn from a stress test? IMHO it's fairly
> natural that if there are many temporary files and/or if deleting a file
> is expensive on a given filesystem, the cleanup may take time.

The thing that comes to mind for me is just getting a sense of what the
curve looks like for number of files versus startup time. If I can find
some time then I'll poke around and share numbers.

I remember awhile ago, I worked with a PostgreSQL user who had a major
outage crisis on their primary production database. They were having
some minor issues, and they decided to do a "quick" restart to see if it
would clear things out. The restart ended up taking something like a day
or two and the business was down the whole time. Working with them to
figure out what was happening, we found out that their very-DDL-heavy
workload had combined with a stuck checkpointer process. No checkpoints
had been completed for over a week. Only choices were waiting for WAL to
replay or taking data loss; we couldn't even get out of pain with a
restore from backup - sinces a restore still required replaying all the
same WAL.

There are certain core features in a database that you really need to be
as reliable and robust as possible. IMO, for critical production
databases, quick-as-possible-restarts are one of those.

>> In our fleet there have been cases where we have seen hundreds of
>> thousands of temp files generated.  I found a case where we helped a
>> customer that had a little over 2.2 million temp files.  Single
>> threaded cleanup of these takes a significant amount of time and
>> delays recovery. In RDS, we mitigated this by moving the pgsql_tmp
>> directory aside, start the engine and then separately remove the old
>> temp files.
>>
>> After noticing the current plans to default this GUC to "on" in v14,
>> just thought I'd raise the question of whether this should get a
>> little more discussion or testing with higher numbers of temp files?
>>
> 
> I doubt we can lean anything new from such testing.
> 
> Of course, we can discuss the default for the GUC. I see it as a trade
> off between risk of running out of disk space and increased recovery
> time, and perhaps the decision to prioritize lower risk of running out
> of disk space was not the right one ...

I'm doing a little asking around with colleagues. I'm having trouble
finding cases where people went end-to-end and figured out exactly what
in the workload was causing the high number of temp files. However,
there seems to be a fair number of incidents with numbers of temp files
in the hundreds of thousands.

One thing that seems possible is that in some of these cases, the temp
files were accumulating across many engine crashes - those cases would
not be an issue once you started cleaning up on every restart. However I
suspect there are still some cases where high connection counts and some
erratic workload characteristic or bugs are causing accumulation without
multiple crashes. If I learn more, I'll relay it along.

Frankly, if the GUC defaults to off, then we're a lot less likely to
find out if there /are/ issues. Kinda like LLVM and parallel query... at
some point you just have to turn it on... even if you're not 100% sure
where all the sharp edges are yet... PostgreSQL meme: "I test in someone
else's production"

All of that said, FWIW, if a restart is taking too long then a user can
always turn the GUC off and cancel/retry the startup. So this is not the
same as a stuck checkpointer, because there's simple recourse.

For my part, I appreciate this discussion. I missed it if these points
were debated when the feature was first committed and I can see
arguments both ways. It's not without precedent to have a new feature
turned off by default for its' first major release version. But we're
talking about a corner case situation, and it's not like users are
without recourse.

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services