Thread: Deleting a table file does not raise an error when the table is touched afterwards, why?

Hi,

I need to understand something: Lets assume I have a table t5 with 1'000'000 rows:

(postgres@[local]:5432) [sample] > select count(*) from t5;
  count  
---------
 1000000
(1 row)

Time: 2363.834 ms
(postgres@[local]:5432) [sample] >

I get the file for that table:

postgres@pg_essentials_p1:/u02/pgdata/PG1/base/16422/ [PG1] oid2name -d sample -t t5
From database "sample":
  Filenode  Table Name
----------------------
     32809          t5


Then I delete the file:

postgres@pg_essentials_p1:/u02/pgdata/PG1/base/16422/ [PG1] rm 32809

When doing the count(*) on the table again:

(postgres@[local]:5432) [sample] > select count(*) from t5;
  count  
---------
 1000000
(1 row)

No issue in the log. This is probably coming from the cache, isn't it? Is this intended and safe?

Then I restart the instance and do the select again:

2016-05-30 19:25:20.633 CEST - 9 - 2777 -  - @ FATAL:  could not open file "base/16422/32809": No such file or directory
2016-05-30 19:25:20.633 CEST - 10 - 2777 -  - @ CONTEXT:  writing block 8192 of relation base/16422/32809

(postgres@[local]:5432) [sample] > select count(*) from t5;
 count  
--------
 437920
(1 row)

Can someone please tell me the intention behind that? From my point of view this is dangerous. If nobody is monitoring the log (which sadly is the case in reality) nobody will notice that only parts of the table are there. Wouldn't it be much more safe to raise an error as soon as the table is touched?

PostgreSQL version:

(postgres@[local]:5432) [sample] > select version();
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------
version | PostgreSQL 9.4.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4), 64-bit

Thanks in advance
Daniel



On Mon, May 30, 2016 at 11:35 AM, Daniel Westermann <daniel.westermann@dbi-services.com> wrote:
Hi,

I need to understand something: Lets assume I have a table t5 with 1'000'000 rows:

(postgres@[local]:5432) [sample] > select count(*) from t5;
  count  
---------
 1000000
(1 row)

Time: 2363.834 ms
(postgres@[local]:5432) [sample] >

I get the file for that table:

postgres@pg_essentials_p1:/u02/pgdata/PG1/base/16422/ [PG1] oid2name -d sample -t t5
From database "sample":
  Filenode  Table Name
----------------------
     32809          t5


Then I delete the file:

postgres@pg_essentials_p1:/u02/pgdata/PG1/base/16422/ [PG1] rm 32809

When doing the count(*) on the table again:

(postgres@[local]:5432) [sample] > select count(*) from t5;
  count  
---------
 1000000
(1 row)

No issue in the log. This is probably coming from the cache, isn't it? Is this intended and safe?

Then I restart the instance and do the select again:

2016-05-30 19:25:20.633 CEST - 9 - 2777 -  - @ FATAL:  could not open file "base/16422/32809": No such file or directory
2016-05-30 19:25:20.633 CEST - 10 - 2777 -  - @ CONTEXT:  writing block 8192 of relation base/16422/32809

(postgres@[local]:5432) [sample] > select count(*) from t5;
 count  
--------
 437920
(1 row)

Can someone please tell me the intention behind that? From my point of view this is dangerous. If nobody is monitoring the log (which sadly is the case in reality) nobody will notice that only parts of the table are there. Wouldn't it be much more safe to raise an error as soon as the table is touched?

PostgreSQL version:

(postgres@[local]:5432) [sample] > select version();
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------
version | PostgreSQL 9.4.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4), 64-bit

Thanks in advance
Daniel


I have not heard of pg_essentials, but obviously it is external to the PostgreSQL server.
PostgreSQL cannot tell is someone is intentionally messing with the file system. You have removed only the first file node with rm 32809.
First off, you should never do that. If you want to drop the table, then do DROP TABLE t5;
That will drop all the file nodes for that table.
You may as well ask "If I shoot myself in the head, why don't I feel any pain?".
You could also do rm -r *.* if you really want to screw the pooch. The O/S won't complain, but you will be very sorry!
--
Melvin Davidson
I reserve the right to fantasize.  Whether or not you
wish to share my fantasy is entirely up to you.

Hi Daniel:

On Mon, May 30, 2016 at 5:35 PM, Daniel Westermann
<daniel.westermann@dbi-services.com> wrote:
> I get the file for that table:
...
> Then I delete the file:

Well, you corrupted the database and invoked undefined behaviour ( not
exactly, but postgres is not designed for this ).

> No issue in the log. This is probably coming from the cache, isn't it? Is
> this intended and safe?

It's probably not intended. It can come from the cache or it can
arrive from the fact that you are running a unix flavour. In unix ( at
the OS level, in the clasical filesystems ) you do not delete a file,
you unlink it ( remove the pointer to it in the directory  ), the file
is removed by the OS when nobody can reach it, which means nobody has
it open an no directory points to it ( so no one else can open it, is
like reference counting ) ( In fact this behaviour is used on purpose
for temporary files, you open it, unlink it and know when you exit,
either normaly or crashing, the OS deletes it ). Postgres has the file
open, and probably does not bother checking wether somebody removed it
under from the directory, as there is no correct behaviour in this
case, so no point in checking it.

> Then I restart the instance and do the select again:
> 2016-05-30 19:25:20.633 CEST - 9 - 2777 -  - @ FATAL:  could not open file
> "base/16422/32809": No such file or directory

As expected.

> Can someone please tell me the intention behind that? From my point of view
> this is dangerous. If nobody is monitoring the log (which sadly is the case
> in reality) nobody will notice that only parts of the table are there.
> Wouldn't it be much more safe to raise an error as soon as the table is
> touched?

If you are going to implement idealised behaviour, prohibiting people
from deleting it would be better.

Any user with minimu knwledge and enouugh privileges can put programs
in states from which they cannot recover, there is not point in
checking every corner case. In fact, if you can remove the file under
the servers feet you can probably alter the running server memory,
which would you think the correct behaviour would be for a 'poke
rand(),rand()' in the server process? It could have triple redundancy
copy of every page and try to vote and detect in each instruction, but
is pointless.

Francisco Olarte.


On 30 May 2016 at 16:35, Daniel Westermann <daniel.westermann@dbi-services.com> wrote:
... 
Then I delete the file:
...
No issue in the log. This is probably coming from the cache, isn't it? Is this intended and safe?

Postgres manages your data for you. What you're doing is not a supported use case and I recommend not to do that in the future.
 
Can someone please tell me the intention behind that? From my point of view this is dangerous. If nobody is monitoring the log (which sadly is the case in reality) nobody will notice that only parts of the table are there. Wouldn't it be much more safe to raise an error as soon as the table is touched?

How would we know that an external agent had deleted the file? What action should we take if we did notice?

It's a very good thing that we remain flying even with multiple bullet holes in the wings.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Daniel Westermann <daniel.westermann@dbi-services.com> writes:
> Then I delete the file:
> postgres@pg_essentials_p1:/u02/pgdata/PG1/base/16422/ [PG1] rm 32809

There's a reason why the database directory is not readable/writable
by unprivileged users: it's to prevent them from doing dumb things
like that.  People who do have write access on the database are
assumed to know better.

> Wouldn't it be much more safe to raise an error as soon as the table is touched?

Well, yes, but it would impose huge amounts of overhead in order to
raise an error a bit sooner for a stupid user action.  The ideal
thing would be to prevent users from breaking their database in the
first place --- but there's not much we can do in that direction
beyond setting the directory permissions.

            regards, tom lane



On 30.05.2016 18:35, Daniel Westermann wrote:
Hi,

I need to understand something: Lets assume I have a table t5 with 1'000'000 rows:

(postgres@[local]:5432) [sample] > select count(*) from t5;
  count  
---------
 1000000
(1 row)

Time: 2363.834 ms
(postgres@[local]:5432) [sample] >

I get the file for that table:

postgres@pg_essentials_p1:/u02/pgdata/PG1/base/16422/ [PG1] oid2name -d sample -t t5
From database "sample":
  Filenode  Table Name
----------------------
     32809          t5


Then I delete the file:

postgres@pg_essentials_p1:/u02/pgdata/PG1/base/16422/ [PG1] rm 32809

When doing the count(*) on the table again:

(postgres@[local]:5432) [sample] > select count(*) from t5;
  count  
---------
 1000000
(1 row)

No issue in the log. This is probably coming from the cache, isn't it? Is this intended and safe?

Then I restart the instance and do the select again:

2016-05-30 19:25:20.633 CEST - 9 - 2777 -  - @ FATAL:  could not open file "base/16422/32809": No such file or directory
2016-05-30 19:25:20.633 CEST - 10 - 2777 -  - @ CONTEXT:  writing block 8192 of relation base/16422/32809

(postgres@[local]:5432) [sample] > select count(*) from t5;
 count  
--------
 437920
(1 row)

Can someone please tell me the intention behind that? From my point of view this is dangerous. If nobody is monitoring the log (which sadly is the case in reality) nobody will notice that only parts of the table are there. Wouldn't it be much more safe to raise an error as soon as the table is touched?

PostgreSQL version:

(postgres@[local]:5432) [sample] > select version();
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------
version | PostgreSQL 9.4.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4), 64-bit

Thanks in advance
Daniel


Hi if you delete file from external process that open this file this external process never ever notice it. Only after it  close this file handler you fall it some issues with "file not exist" and other.

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, 30 May 2016 17:35:34 +0200 (CEST), Daniel Westermann
(daniel.westermann@dbi-services.com) wrote about "[GENERAL] Deleting a
table file does not raise an error when the table is touched
afterwards, why?" (in
<184509399.5590018.1464622534207.JavaMail.zimbra@dbi-services.com>):

[snip]
> Then I delete the file:
>
> postgres@pg_essentials_p1:/u02/pgdata/PG1/base/16422/ [PG1] rm
> 32809

Actually, you are not deleting the file. You are asking the filesystem
driver to delete it when it has stopped being used. The directory
entry is removed immediately though, so that no other process can open i
t.

> When doing the count(*) on the table again:
>
> (postgres@[local]:5432) [sample] > select count(*) from t5; count
> --------- 1000000 (1 row)
>
> No issue in the log. This is probably coming from the cache, isn't
> it?

No, the file still exists because a PG back-end still has it open.

> Is this intended and safe?

It is standard UNIX behaviour. It is not safe because you are not
supposed to do things that way.
- --
Regards,

Dave  [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
david.w.noon@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iEYEARECAAYFAldMcjcACgkQogYgcI4W/5T/xgCfaQBh6g0WCBRkeNOlRK4Kbc43
Gs4An0UXb+piw+BQUGJupPtN+oHJZjVH
=td+i
-----END PGP SIGNATURE-----


>>>
>>>On Mon, 30 May 2016 17:35:34 +0200 (CEST), Daniel Westermann
>>>(daniel.westermann@dbi-services.com) wrote about "[GENERAL] Deleting a
>>>table file does not raise an error when the table is touched
>>>afterwards, why?" (in
>>><184509399.5590018.1464622534207.JavaMail.zimbra@dbi-services.com>):
>>>
>>>[snip]
>>>> Then I delete the file:
>>>>
>>>> postgres@pg_essentials_p1:/u02/pgdata/PG1/base/16422/ [PG1] rm
>>>> 32809
>>>
>>>Actually, you are not deleting the file. You are asking the filesystem
>>>driver to delete it when it has stopped being used. The directory
>>>entry is removed immediately though, so that no other process can open i
>>>t
>>>
>>>> When doing the count(*) on the table again:
>>>>
>>>> (postgres@[local]:5432) [sample] > select count(*) from t5; count
>>>> --------- 1000000 (1 row)
>>>>
>>>> No issue in the log. This is probably coming from the cache, isn't
>>>> it?
>>>
>>>No, the file still exists because a PG back-end still has it open.
>>>
>>>> Is this intended and safe?
>>>
>>>It is standard UNIX behaviour. It is not safe because you are not
>>>supposed to do things that way.
>>>- --
>>>Regards,
>>>
>>>Dave  [RLU #314465]

Thanks all for your answers. Maybe I should have provided more background information: We had an internal workshop today and one of the topics was backup/restore. One of the questions was what will happen if (for any reason) a file gets deleted so we tested it. I am aware that this is not the common use case. But still I want to understand why PostgreSQL works the way described. From the answers I understand this:

- the file is not really deleted because PostgreSQL is still using it => correct?
- if the above is correct why does PostgreSQL only write a partial file back to disk/wal? For me this still seems dangerous as potentially nobody will notice it
- PostgreSQL assumes that someone with write access to the files knows what she/he is doing. ok, but still, in the real world cases like this happen (for whatever reason)

Simon's answer:
- It's a very good thing that we remain flying even with multiple bullet holes in the wings.
Really? It depends on how you look at that, doesn't it? I'd prefer to get an error in this case, maybe I am wrong but I prefer to be noticed if a file is missing instead of getting results

Tom's answer:
- Well, yes, but it would impose huge amounts of overhead in order to raise an error a bit sooner for a stupid user action.  The ideal thing would be to prevent users from breaking their database in the first place --- but there's not much we can do in that direction beyond setting the directory permissions.
Ok, makes sense. But "a bit sooner" to what? The count(*) just returns a result. From a user perspective I have no idea that the result is wrong

Thanks again
Daniel


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, 30 May 2016 19:49:36 +0200 (CEST), Daniel Westermann
(daniel.westermann@dbi-services.com) wrote about "Re: [GENERAL]
Deleting a table file does not raise an error when the table is
touched afterwards, why?" (in
<1247360337.5599235.1464630576430.JavaMail.zimbra@dbi-services.com>):

[snip]
> Thanks all for your answers. Maybe I should have provided more
> background information: We had an internal workshop today and one
> of the topics was backup/restore. One of the questions was what
> will happen if (for any reason) a file gets deleted so we tested
> it. I am aware that this is not the common use case. But still I
> want to understand why PostgreSQL works the way described. From the
> answers I understand this:
>
> - the file is not really deleted because PostgreSQL is still using
> it => correct?

It is correct.

> - if the above is correct why does PostgreSQL only write a partial
> file back to disk/wal?

PG only writes modified pages to WAL, even if another process has
requested the deletion of the tablespace file. In fact, PG is not even
aware of the deletion request.

> For me this still seems dangerous as potentially nobody will notice
> it

It is intrinsically dangerous, which is why only root and postgres
userids have write permissions on physical filesystems to be used by PG.

Indeed, if you use an alternative security model even root will not
have write permissions, only postgres.

> - PostgreSQL assumes that someone with write access to the files
> knows what she/he is doing. ok, but still, in the real world cases
> like this happen (for whatever reason)

It is very unlikely to happen once; it should never happen twice. No
competent DBA would do that once, let alone twice; if a DBA does that
once, he/she ceases to be a DBA.

The issue you are raising here is the misuse of filesystem
permissions. There is nothing PG can do here beyond what it already
does: check that the permissions mask and ownership are as tight as
possible during start-up of each tablespace.

This reminds me of the old music hall joke: a patient goes to the
doctor, raises his arm and says "Doc, it hurts when I do this!", to
which the doctor replies "Well don't do that."

Basically, it behoves your support staff to manage the physical
filesystems correctly and not damage them.
- --
Regards,

Dave  [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
david.w.noon@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iEYEARECAAYFAldMiSMACgkQogYgcI4W/5T4UgCgzQQGvhdo+yxr7VSUPyzTJMa8
xAwAn3vPHQ4UOOhSL4kjCtl6Cq5sVeb0
=/lld
-----END PGP SIGNATURE-----


Daniel Westermann <daniel.westermann@dbi-services.com> writes:
> - if the above is correct why does PostgreSQL only write a partial file back to disk/wal? For me this still seems
dangerousas potentially nobody will notice it  

In quiescent circumstances, Postgres wouldn't have written anything at
all, and the file would have disappeared completely at server shutdown,
and you would have gotten some sort of file-not-found error when you tried
the "count(*)" after restarting.  I hypothesize that you did an unclean
shutdown leading to replaying some amount of WAL at restart, and that WAL
included writing at least one block of the file (perhaps as a result of a
hint-bit update, or some other not-user-visible maintenance operation,
rather than anything you did explicitly).  The WAL replay code will
recreate the file if it doesn't exist on-disk --- this is important for
robustness.  Then you'd have a file that exists on-disk but is partially
filled with empty pages, which matches the observed behavior.  Depending
on various details you haven't provided, this might be indistinguishable
from a valid database state.

> - PostgreSQL assumes that someone with write access to the files knows what she/he is doing. ok, but still, in the
realworld cases like this happen (for whatever reason)  

[ shrug... ] There's also an implied contract that you don't do "rm -rf /",
or shoot the disk drive full of holes with a .45, or various other
unrecoverable actions.  We're not really prepared to expend large amounts
of developer effort, or large amounts of runtime overhead, to detect such
cases.  (In particular, the fact that all-zero pages are a valid state is
unfortunate from this perspective, but it's more or less forced by
robustness concerns associated with table-extension behavior.  Most users
would not thank us for making table extension slower in order to issue a
more intelligible error for examples like this one.)

            regards, tom lane


On Mon, May 30, 2016 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Daniel Westermann <daniel.westermann@dbi-services.com> writes:
> - if the above is correct why does PostgreSQL only write a partial file back to disk/wal? For me this still seems dangerous as potentially nobody will notice it

In quiescent circumstances, Postgres wouldn't have written anything at
all, and the file would have disappeared completely at server shutdown,
and you would have gotten some sort of file-not-found error when you tried
the "count(*)" after restarting.  I hypothesize that you did an unclean
shutdown leading to replaying some amount of WAL at restart, and that WAL
included writing at least one block of the file (perhaps as a result of a
hint-bit update, or some other not-user-visible maintenance operation,
rather than anything you did explicitly).  The WAL replay code will
recreate the file if it doesn't exist on-disk --- this is important for
robustness.  Then you'd have a file that exists on-disk but is partially
filled with empty pages, which matches the observed behavior.  Depending
on various details you haven't provided, this might be indistinguishable
from a valid database state.


I suspect that page checksums might have detected the broken state, but if any of the written pages were partials since the non-overwritten-zeros on the partially written pages would have resulted in a different hash.

> - PostgreSQL assumes that someone with write access to the files knows what she/he is doing. ok, but still, in the real world cases like this happen (for whatever reason)

[ shrug... ] There's also an implied contract that you don't do "rm -rf /",
or shoot the disk drive full of holes with a .45, or various other
unrecoverable actions.  We're not really prepared to expend large amounts
of developer effort, or large amounts of runtime overhead, to detect such
cases.  (In particular, the fact that all-zero pages are a valid state is
unfortunate from this perspective, but it's more or less forced by
robustness concerns associated with table-extension behavior.  Most users
would not thank us for making table extension slower in order to issue a
more intelligible error for examples like this one.)

​rant​

​I have to think that we can reasonably ascribe unexpected system state to causes other than human behavior.  In both of the other examples PostgreSQL would fail to start so I'd say we have expected behavior in the face of those particular unexpected system states.

​IMO too much attention is being paid to the act of recreation.  But even if we presume that the only viable way to recreate this circumstance is to do so intentionally we've documented a clever way for someone to mess with the system in a subtle manner.

Up until Tom's last email I got very little out of the discussion.  It doesn't fill me with confidence when such an important topic is taken too glibly.  I suspect a large number of uses of PostgreSQL are in situations where if the application works everything is assumed to be fine.  People know that random things happen to hardware and that software can have bugs.  That is what this thread describes -  a potential situation that could happen due to non-human causes that results in a somewhat silently mis-operating system.

​There is still quite a bit of hand-waving here though - and I don't know whether being more precise really doesn't an end-user enough good that it would be worth writing up in the user-facing docs.  Like all areas I'm sure this is open to improvement but I'm sufficiently happy that the probability of an event of this precision is sufficiently unlikely to thus warrant the present behavior.​

​/rant​

David J.

On Mon, May 30, 2016 at 3:32 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
​I have to think that we can reasonably ascribe unexpected system state to causes other than human behavior.  In both of the other examples PostgreSQL would fail to start so I'd say we have expected behavior in the face of those particular unexpected system states.

Alex Ignatov started a new thread was started on this topic as well...​


I posted a link to this thread on his new one as well.

David J.​
>> Alex Ignatov started a new thread was started on this topic as well...​
>>
>>
>>I posted a link to this thread on his new one as well.
>>
>>David J.​


for completeness: same issue with data checksums enabled:

(postgres@[local]:5555) [postgres] > show data_checksums ;
 data_checksums
----------------
 on
(1 row)

Steps to reproduce:
create table t1 ( a int );
insert into t1 values ( generate_series(1,1000000));
select count(*) from t1;

=> rm the table files

=> select count(*) still works

(postgres@[local]:5555) [postgres] > select count(*) from t1;
  count  
---------
 1000000
(1 row)

=> then shutdown fast:

LOG:  received fast shutdown request
LOG:  aborting any active transactions
FATAL:  terminating connection due to administrator command
LOG:  autovacuum launcher shutting down
LOG:  shutting down
FATAL:  could not open file "base/13056/16384": No such file or directory
CONTEXT:  writing block 2946 of relation base/13056/16384
LOG:  checkpointer process (PID 3004) exited with exit code 1
LOG:  terminating any other active server processes
LOG:  abnormal database system shutdown

=> startup

LOG:  database system shutdown was interrupted; last known up at 2016-05-31 10:52:17 CEST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/7070808
LOG:  record with zero length at 0/76EECD0
LOG:  redo done at 0/76EEC68
LOG:  last completed transaction was at log time 2016-05-31 10:51:27.689776+02
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

(postgres@[local]:5555) [postgres] > select count(*) from t1;
 count  
--------
 186450
(1 row)





Hi:

On Tue, May 31, 2016 at 8:58 AM, Daniel Westermann
<daniel.westermann@dbi-services.com> wrote:
> for completeness: same issue with data checksums enabled:
...
> => rm the table files
> => select count(*) still works

And ? I would vote for not doing anything on this, after all you are
working outside the 'envelope' on this. Is like saying 'I commit with
fsync enabled, but then I zero the disk and the data is lost'. As I
said is like undefined behaviour in C. Currently *0 tends to SIGSEGV
on both reads and writes, but I've worked in OSs where it did only on
writes, and where it worked fine ( like CP/M ). For you the correct
behaviour maybe to fail fast and loose a little speed, for others the
current one may be OK. Normally for this things you go for the path
with less code, as deleted code is bug free.

Francisco Olarte.