Thread: Orphaned relations after crash/sigkill during CREATE TABLE
Postgres 12.4
I was directed in slack to mention here that we're being impacted by Postgres leaving orphaned pages in /base/<db> after a crash while a CREATE TABLE is being run in transaction.
The issue is the same as the reproduction steps listed here [1], that is:
- Start a CREATE TABLE transaction for a large table
- Terminate the process via kill -9
- Pages are left in /base that have no filenode references anymore, such that `pg_database_size()` grows to be very large while total table+index size remains constant
However in our particular case, we're using a managed/cloud Postgres server and our `CREATE TABLE` transaction was being terminated by the OOM killer. Using a managed service, we don't have filesystem access to go and clear out these orphaned pages. This caused our total db size to grow from 40GB of table+index data to 4TB on-disk (but still only 40GB of table+index data, the other ~3.95TB being orphaned CREATE TABLE pages)
I realize (per a blog post from Robert Haas [2] and from slack conversation) that this is a known issue, but was directed here from slack to just mention that we were impacted by it, and have no resolution due to not having filesystem access, and not having a method internally to Postgres to deal with these orphaned relations. (Our recourse currently is to do something like a pg_dump/pg_restore onto a clean instance in order to escape the orphaned files)
-Jason
On 8/18/20 12:35 PM, Jason Myers wrote: > Postgres 12.4 > > I was directed in slack to mention here that we're being impacted by > Postgres leaving orphaned pages in /base/<db> after a crash while a > CREATE TABLE is being run in transaction. > > The issue is the same as the reproduction steps listed here [1], that is: > > - Start a CREATE TABLE transaction for a large table > - Terminate the process via kill -9 > - Pages are left in /base that have no filenode references anymore, such > that `pg_database_size()` grows to be very large while total table+index > size remains constant So from [1] you are using CREATE TABLE AS. Have you tried with: BEGIN; CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH NO DATA; COMMIT; The above gets you the table structure, but no data. BEGIN; INSERT into some_table SELECT * FROM other_table; COMMIT; The above populates the table. Have not tested but I'm going to assume if you kill the above the problem would not happen or would be fixable by DELETE FROM some_table/TRUNCATE some_table; > > However in our particular case, we're using a managed/cloud Postgres > server and our `CREATE TABLE` transaction was being terminated by the > OOM killer. Using a managed service, we don't have filesystem access to > go and clear out these orphaned pages. This caused our total db size to > grow from 40GB of table+index data to 4TB on-disk (but still only 40GB > of table+index data, the other ~3.95TB being orphaned CREATE TABLE pages) > > I realize (per a blog post from Robert Haas [2] and from slack > conversation) that this is a known issue, but was directed here from > slack to just mention that we were impacted by it, and have no > resolution due to not having filesystem access, and not having a method > internally to Postgres to deal with these orphaned relations. (Our > recourse currently is to do something like a pg_dump/pg_restore onto a > clean instance in order to escape the orphaned files) > > -Jason > > [1] https://github.com/bdrouvot/pg_orphaned#example-1 > [2] > https://rhaas.blogspot.com/2020/05/dont-manually-modify-postgresql-data.html > -- Adrian Klaver adrian.klaver@aklaver.com
> On Tue, Aug 18, 2020 at 3:49 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
> Have you tried with:
> BEGIN;
> CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH
> NO DATA;
> COMMIT;
>
> The above gets you the table structure, but no data.
>
> BEGIN;
> INSERT into some_table SELECT * FROM other_table;
>COMMIT;
>
> The above populates the table
> CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH
> NO DATA;
> COMMIT;
>
> The above gets you the table structure, but no data.
>
> BEGIN;
> INSERT into some_table SELECT * FROM other_table;
>COMMIT;
>
> The above populates the table
Thanks -- we were indeed creating and populating the new table all in a single transaction.
I'll see if we can split this into two transactions so that the table structure is committed quickly. I think you're right that this would mostly sidestep the issue.
-Jason
p.s. Apologies if this is formatted wrong, this is my first mailing list post.
On 8/18/20 1:19 PM, Jason Myers wrote: > > On Tue, Aug 18, 2020 at 3:49 PM Adrian Klaver > <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote: > > > Have you tried with: > > BEGIN; > > CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH > > NO DATA; > > COMMIT; > > > > The above gets you the table structure, but no data. > > > > BEGIN; > > INSERT into some_table SELECT * FROM other_table; > >COMMIT; > > > > The above populates the table > > Thanks -- we were indeed creating and populating the new table all in a > single transaction. > > I'll see if we can split this into two transactions so that the table > structure is committed quickly. I think you're right that this would > mostly sidestep the issue. Assuming the table you are pulling from is fairly static, it could also allow you to regulate the amount of data you transfer at any one time into the new table. > > -Jason > > p.s. Apologies if this is formatted wrong, this is my first mailing list > post. -- Adrian Klaver adrian.klaver@aklaver.com
On Tue, Aug 18, 2020 at 3:49 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
> So from [1] you are using CREATE TABLE AS. Have you tried with:
>
> BEGIN;
> CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH
> NO DATA;
> COMMIT;
>
> The above gets you the table structure, but no data.
>
> BEGIN;
> INSERT into some_table SELECT * FROM other_table;
> COMMIT;
>
> The above populates the table. Have not tested but I'm going to assume
> if you kill the above the problem would not happen or would be fixable
> by DELETE FROM some_table/TRUNCATE some_table;
>
> BEGIN;
> CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH
> NO DATA;
> COMMIT;
>
> The above gets you the table structure, but no data.
>
> BEGIN;
> INSERT into some_table SELECT * FROM other_table;
> COMMIT;
>
> The above populates the table. Have not tested but I'm going to assume
> if you kill the above the problem would not happen or would be fixable
> by DELETE FROM some_table/TRUNCATE some_table;
I was able to implement this, which creates the table quickly in a first transaction and populates it in a second transaction.
However we were still seeing orphaned files on crash, and I believe I tracked it down to subsequent CREATE INDEX statements also creating these orphaned files (if they are running during a crash).
Is that issue known as well? I don't believe I can use the same trick to sidestep that one...
-Jason
On 8/20/20 1:37 PM, Jason Myers wrote: > > On Tue, Aug 18, 2020 at 3:49 PM Adrian Klaver <adrian.klaver@aklaver.com > <mailto:adrian.klaver@aklaver.com>> wrote: > > So from [1] you are using CREATE TABLE AS. Have you tried with: > > > > BEGIN; > > CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH > > NO DATA; > > COMMIT; > > > > The above gets you the table structure, but no data. > > > > BEGIN; > > INSERT into some_table SELECT * FROM other_table; > > COMMIT; > > > > The above populates the table. Have not tested but I'm going to assume > > if you kill the above the problem would not happen or would be fixable > > by DELETE FROM some_table/TRUNCATE some_table; > > I was able to implement this, which creates the table quickly in a first > transaction and populates it in a second transaction. > > However we were still seeing orphaned files on crash, and I believe I > tracked it down to subsequent CREATE INDEX statements also creating > these orphaned files (if they are running during a crash). If the crashes are still being caused by the OOM killer then it looks to me you need a more capable cloud instance. Can you partition the tables to break the work into smaller units? > > Is that issue known as well? I don't believe I can use the same trick > to sidestep that one... > > -Jason -- Adrian Klaver adrian.klaver@aklaver.com
Jason Myers <j.myers@brstrat.com> writes: > However we were still seeing orphaned files on crash, and I believe I > tracked it down to subsequent CREATE INDEX statements also creating these > orphaned files (if they are running during a crash). > Is that issue known as well? I don't believe I can use the same trick to > sidestep that one... Yeah, it's entirely intentional that we don't try to clean up orphaned disk files after a database crash. There's a long discussion of this and related topics in src/backend/access/transam/README. What that says about why not is that such files' contents might be useful for forensic analysis of the crash, and anyway "Orphan files are harmless --- at worst they waste a bit of disk space". A point not made in that text, but true anyway, is that it'd also be quite expensive to search a large database for orphaned files, so people would likely not want to pay that price on the way to getting their database back up. There might be value in a user-invokable tool that runs in an existing non-crashed database and looks for orphan files, but I'm not aware that anyone has written one. (Race conditions against concurrent table creation would be a problem; but probably that can be finessed somehow, maybe by noting the file's creation time.) In the meantime I've got to say that routinely kill 9'ing database processes just doesn't seem like a very good idea. Yeah, we do our best to ensure that there won't be data loss, but you're really doubling down on a hard assumption that Postgres contains zero bugs when you operate that way. I'd suggest reconfiguring things to avoid the OOM kill hazard; or if your cloud provider makes that effectively impossible, maybe you need another provider. But on most systems I'd think you could use ulimit or the like even if you don't have root privileges. regards, tom lane
On Thu, Aug 20, 2020 at 5:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jason Myers <j.myers@brstrat.com> writes:
> However we were still seeing orphaned files on crash, and I believe I
> tracked it down to subsequent CREATE INDEX statements also creating these
> orphaned files (if they are running during a crash).
> Is that issue known as well? I don't believe I can use the same trick to
> sidestep that one...
Yeah, it's entirely intentional that we don't try to clean up orphaned
disk files after a database crash. There's a long discussion of this and
related topics in src/backend/access/transam/README. What that says about
why not is that such files' contents might be useful for forensic analysis
of the crash, and anyway "Orphan files are harmless --- at worst they
waste a bit of disk space". A point not made in that text, but true
anyway, is that it'd also be quite expensive to search a large database
for orphaned files, so people would likely not want to pay that price
on the way to getting their database back up.
There might be value in a user-invokable tool that runs in an existing
non-crashed database and looks for orphan files, but I'm not aware that
anyone has written one. (Race conditions against concurrent table
creation would be a problem; but probably that can be finessed somehow,
maybe by noting the file's creation time.)
In the meantime I've got to say that routinely kill 9'ing database
processes just doesn't seem like a very good idea. Yeah, we do our best
to ensure that there won't be data loss, but you're really doubling down
on a hard assumption that Postgres contains zero bugs when you operate
that way. I'd suggest reconfiguring things to avoid the OOM kill hazard;
or if your cloud provider makes that effectively impossible, maybe you
need another provider. But on most systems I'd think you could use ulimit
or the like even if you don't have root privileges.
regards, tom lane
Understood, thanks for the reply.
-Jason
On 8/20/20 14:46, Tom Lane wrote: > There might be value in a user-invokable tool that runs in an existing > non-crashed database and looks for orphan files, but I'm not aware that > anyone has written one. (Race conditions against concurrent table > creation would be a problem; but probably that can be finessed somehow, > maybe by noting the file's creation time.) Bertrand Drouvot just put out a tool to list potentially orphaned files a couple months ago. https://github.com/bdrouvot/pg_orphaned Implemented as an extension, doesn't remove the files, but useful nonetheless. -Jeremy -- http://about.me/jeremy_schneider