[BUGS] BUG #14555: EBUSY error on read() on NFS - Mailing list pgsql-bugs

From ashwath.rao@altair.com
Subject [BUGS] BUG #14555: EBUSY error on read() on NFS
Date
Msg-id 20170220030121.1265.79953@wrigleys.postgresql.org
Whole thread Raw
Responses Re: [BUGS] BUG #14555: EBUSY error on read() on NFS  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [BUGS] BUG #14555: EBUSY error on read() on NFS  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      14555
Logged by:          Ashwath Rao
Email address:      ashwath.rao@altair.com
PostgreSQL version: 9.3.6
Operating system:   SLES11SP4
Description:

We use Postgres version 9.3.6 with our product PBS Professional. We're
having an issue that seems to have cropped up at one of our customer because
of the filesystem for the datastore but only exposed with the PostGreSQL
update. The datastore sometimes does not work. When we actually try to dump
the datastore we get very similar messages to the pg_log messages:

#/opt/pbs/default/pgsql/bin/pg_dump -U <USER>-p 15007 pbs_datastore >
pbs_datastore_14022017.sql
Password: 
pg_dump: Dumping the contents of table "job_attr" failed: PQgetResult()
failed.
pg_dump: Error message from server: ERROR:  could not read block 69600 in
file "base/16384/16555": Device or resource busy
pg_dump: The command was: COPY pbs.job_attr (ji_jobid, attr_name,
attr_resource, attr_value, attr_flags) TO stdout;


Once we have this, we seem to have errors only on that one file, but EBUSY
is _really_ puzzling. It's not one of the valid errno setting for a read()
system call.

Bizarrely, problems seem to be reported on a small number of blocks:
# grep ERR /panfs/e/PBS/datastore/pg_log/pbs_dataservice_log.Tue
2017-02-14 08:13:33 UTCERROR:  could not read block 69600 in file
"base/16384/16555": Device or resource busy
2017-02-14 08:15:59 UTCERROR:  could not read block 99298 in file
"base/16384/16555": Device or resource busy
2017-02-14 08:37:29 UTCERROR:  could not read block 9608 in file
"base/16384/16555": Device or resource busy

But it's not consistently the same block:
# /opt/pbs/default/pgsql/bin/pg_dump -U <USER> -p 15007 pbs_datastore >
pbs_datastore_17012017_new.sql
Password: 
pg_dump: Dumping the contents of table "job_attr" failed: PQgetResult()
failed.
pg_dump: Error message from server: ERROR:  could not read block 105740 in
file "base/16384/16555": Device or resource busy
pg_dump: The command was: COPY pbs.job_attr (ji_jobid, attr_name,
attr_resource, attr_value, attr_flags) TO stdout;

When you're in this state, then even the usual error recovery methods don't
work reliably:
# $PBS_EXEC/pgsql/bin/psql -U <USER> -p 15007 -d pbs_datastore
Password for user crayadm: 
psql (9.3.6)
Type "help" for help.
 
pbs_datastore=# set search_path to pbs;
SET
pbs_datastore=# set zero_damaged_pages=on;
SET
pbs_datastore=# vacuum full;
ERROR:  could not read block 99298 in file "base/16384/16555": Device or
resource busy


The file can be read *sequentially* without any issue, though:

dd if=/panfs/e/PBS/datastore/base/16384/16555 of=/tmp/16555 
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB) copied, 10.1065 s, 106 MB/s
xcepbs00:~ # echo $?
0


Last time we had this, just making a tarball and untarring things fixed
everything! In other words, the files appeared to be readable sequentially
without any issue but the I/O patterns used by PostGreSQL seemed to give it
all trouble.

We are trying to see what can actually give back EBUSY? Is this on a read
call or could this be another call? Does the message tell anything about
_where_ in the code there was a failure?


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: [BUGS] BUG #14553: Fatal Error - Role does not exist
Next
From: Tom Lane
Date:
Subject: Re: [BUGS] BUG #14555: EBUSY error on read() on NFS