Home > mailing lists

Thread: [HACKERS] wait events for disk I/O

[HACKERS] wait events for disk I/O

From

Rushabh Lathia

Date:

30 January 2017, 16:01:03

Hi All,

Attached is the patch, which extend the existing wait event infrastructure to
implement the wait events for the disk I/O. Basically pg_stat_activity's wait
event information to show data about disk I/O as well as IPC primitives.

Implementation details:

- Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
- Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
FileWriteback, FileSync, and FileTruncate. Set this wait event just before
performing the file system operation and clear it just after.
- Pass down an appropriate wait event from caller of any of those
functions.
- Also set and clear a wait event around standalone calls to read(),
write(), fsync() in other parts of the system.
- Added documentation for all newly added wait event.

Open issue:
- Might missed few standalone calls to read(), write(), etc which need
to pass the wait_event_info.

Thanks to my colleague Robert Haas for his help in design.

Please let me know your thought, and thanks for reading.

Thanks,

Rushabh Lathia

www.EnterpriseDB.com

Attachment

wait_event_disk_IO.patch

Re: [HACKERS] wait events for disk I/O

From

Michael Paquier

Date:

31 January 2017, 06:24:04

On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Attached is the patch, which extend the existing wait event infrastructure
> to implement the wait events for the disk I/O. Basically pg_stat_activity's
> wait event information to show data about disk I/O as well as IPC primitives.
>
> Implementation details:
>
> - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
> - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
> FileWriteback, FileSync, and FileTruncate. Set this wait event just before
> performing the file system operation and clear it just after.
> - Pass down an appropriate wait event from  caller of any of those
> functions.
> - Also set and clear a wait event around standalone calls to read(),
> write(), fsync() in other parts of the system.
> - Added documentation for all newly added wait event.

Looks neat, those are unlikely to overlap with other wait events.

> Open issue:
> - Might missed few standalone calls to read(), write(), etc which need
> to pass the wait_event_info.

It may be an idea to use LD_PRELOAD with custom versions of read(),
write() and fsync(), and look at the paths where no flags are set in
MyProc->wait_event_info, and log information when that happens.

> Thanks to my colleague Robert Haas for his help in design.
> Please let me know your thought, and thanks for reading.

Did you consider a wrapper of the type pg_read_event() or
pg_write_event(), etc.?
-- 
Michael

Re: [HACKERS] wait events for disk I/O

From

Rushabh Lathia

Date:

31 January 2017, 08:39:17

On Tue, Jan 31, 2017 at 8:54 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Attached is the patch, which extend the existing wait event infrastructure
> to implement the wait events for the disk I/O. Basically pg_stat_activity's
> wait event information to show data about disk I/O as well as IPC primitives.
>
> Implementation details:
>
> - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
> - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
> FileWriteback, FileSync, and FileTruncate. Set this wait event just before
> performing the file system operation and clear it just after.
> - Pass down an appropriate wait event from caller of any of those
> functions.
> - Also set and clear a wait event around standalone calls to read(),
> write(), fsync() in other parts of the system.
> - Added documentation for all newly added wait event.

Looks neat, those are unlikely to overlap with other wait events.

Thanks.

> Open issue:
> - Might missed few standalone calls to read(), write(), etc which need
> to pass the wait_event_info.

It may be an idea to use LD_PRELOAD with custom versions of read(),
write() and fsync(), and look at the paths where no flags are set in
MyProc->wait_event_info, and log information when that happens.

Yes, may be I will try this.

> Thanks to my colleague Robert Haas for his help in design.
> Please let me know your thought, and thanks for reading.

Did you consider a wrapper of the type pg_read_event() or
pg_write_event(), etc.?

I thought on that, but eventually stick to this approach as it looks

more neat and uniform with other wait event implementation.

--
Michael

Thanks,

Rushabh Lathia

www.EnterpriseDB.com

Re: [HACKERS] wait events for disk I/O

From

Rushabh Lathia

Date:

20 February 2017, 13:34:29

My colleague Rahila reported compilation issue with

the patch. Issue was only coming with we do the clean

build on the branch.

Fixed the same into latest version of patch.

Thanks,

On Tue, Jan 31, 2017 at 11:09 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

On Tue, Jan 31, 2017 at 8:54 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Attached is the patch, which extend the existing wait event infrastructure
> to implement the wait events for the disk I/O. Basically pg_stat_activity's
> wait event information to show data about disk I/O as well as IPC primitives.
>
> Implementation details:
>
> - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
> - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
> FileWriteback, FileSync, and FileTruncate. Set this wait event just before
> performing the file system operation and clear it just after.
> - Pass down an appropriate wait event from caller of any of those
> functions.
> - Also set and clear a wait event around standalone calls to read(),
> write(), fsync() in other parts of the system.
> - Added documentation for all newly added wait event.

Looks neat, those are unlikely to overlap with other wait events.

Thanks.

> Open issue:
> - Might missed few standalone calls to read(), write(), etc which need
> to pass the wait_event_info.

It may be an idea to use LD_PRELOAD with custom versions of read(),
write() and fsync(), and look at the paths where no flags are set in
MyProc->wait_event_info, and log information when that happens.

Yes, may be I will try this.

> Thanks to my colleague Robert Haas for his help in design.
> Please let me know your thought, and thanks for reading.

Did you consider a wrapper of the type pg_read_event() or
pg_write_event(), etc.?

I thought on that, but eventually stick to this approach as it looks
more neat and uniform with other wait event implementation.

--
Michael

Thanks,
Rushabh Lathia
www.EnterpriseDB.com

--

Rushabh Lathia

Attachment

wait_event_for_disk_IO_v1.patch

Re: [HACKERS] wait events for disk I/O

From

Amit Kapila

Date:

04 March 2017, 17:23:23

On Mon, Feb 20, 2017 at 4:04 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>
> My colleague Rahila reported compilation issue with
> the patch. Issue was only coming with we do the clean
> build on the branch.
>
> Fixed the same into latest version of patch.
>

Few assorted comments:

1.
+        <row>
+         <entry><literal>WriteBuffile</></entry>
+         <entry>Waiting during
buffile read operation.</entry>
+        </row>

Operation name and definition are not matching.


2.
+FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info){#if defined(USE_POSIX_FADVISE) &&
defined(POSIX_FADV_WILLNEED)int returnCode;
 
@@ -1527,8 +1527,10 @@ FilePrefetch(File file, off_t offset, int amount) if (returnCode < 0) return returnCode;

+ pgstat_report_wait_start(wait_event_info); returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
POSIX_FADV_WILLNEED);
+ pgstat_report_wait_end();

AFAIK, this call is non-blocking and will just initiate a read, so do
you think we should record wait event for such a call.

3.
- written = FileWrite(src->vfd, waldata_start, len);
+ written = FileWrite(src->vfd, waldata_start, len,
+ WAIT_EVENT_WRITE_REWRITE_DATA_BLOCK); if (written != len) ereport(ERROR, (errcode_for_file_access(),
@@ -957,7 +960,7 @@ logical_end_heap_rewrite(RewriteState state) hash_seq_init(&seq_status,
state->rs_logical_mappings);while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL) {
 
- if (FileSync(src->vfd) != 0)
+ if (FileSync(src->vfd, WAIT_EVENT_SYNC_REWRITE_DATA_BLOCK) != 0)


Do we want to consider recording wait event for both write and sync?
It seems to me OS level writes are relatively cheap and sync calls are
expensive, so shouldn't we just log for sync calls?  I could see the
similar usage at multiple places in the patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] wait events for disk I/O

From

Rushabh Lathia

Date:

06 March 2017, 11:27:55

On Sat, Mar 4, 2017 at 7:53 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 20, 2017 at 4:04 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>
> My colleague Rahila reported compilation issue with
> the patch. Issue was only coming with we do the clean
> build on the branch.
>
> Fixed the same into latest version of patch.
>

Few assorted comments:

1.
+ <row>
+ <entry><literal>WriteBuffile</></entry>
+ <entry>Waiting during
buffile read operation.</entry>
+ </row>

Operation name and definition are not matching.

Will fix this.

2.
+FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info)
{
#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
int returnCode;
@@ -1527,8 +1527,10 @@ FilePrefetch(File file, off_t offset, int amount)
if (returnCode < 0)
return returnCode;

+ pgstat_report_wait_start(wait_event_info);
returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
POSIX_FADV_WILLNEED);
+ pgstat_report_wait_end();

AFAIK, this call is non-blocking and will just initiate a read, so do
you think we should record wait event for such a call.

Yes, prefecth call is a non-blocking and will just initiate a read. But this info

about the prefetch into wait events will give more info about the system.

3.
- written = FileWrite(src->vfd, waldata_start, len);
+ written = FileWrite(src->vfd, waldata_start, len,
+ WAIT_EVENT_WRITE_REWRITE_DATA_BLOCK);
if (written != len)
ereport(ERROR,
(errcode_for_file_access(),
@@ -957,7 +960,7 @@ logical_end_heap_rewrite(RewriteState state)
hash_seq_init(&seq_status, state->rs_logical_mappings);
while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
{
- if (FileSync(src->vfd) != 0)
+ if (FileSync(src->vfd, WAIT_EVENT_SYNC_REWRITE_DATA_BLOCK) != 0)

Do we want to consider recording wait event for both write and sync?
It seems to me OS level writes are relatively cheap and sync calls are
expensive, so shouldn't we just log for sync calls? I could see the
similar usage at multiple places in the patch.

Yes, I thought of adding wait event only for the sync but then recording the
wait event for both write and sync. I understand that OS level writes are
cheap but it still have some cost attached to that. Also I thought for the
monitoring tool being develop using this wait events, will have more useful
capture data if we try to collect as much info as we can. Or may be not.

I am open for other opinion/suggestions.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--

Rushabh Lathia

Re: [HACKERS] wait events for disk I/O

From

Robert Haas

Date:

07 March 2017, 02:57:00

On Mon, Mar 6, 2017 at 3:27 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Yes, I thought of adding wait event only for the sync but then recording the
> wait event for both write and sync. I understand that OS level writes are
> cheap but it still have some cost attached to that. Also I thought for the
> monitoring tool being develop using this wait events, will have more useful
> capture data if we try to collect as much info as we can. Or may be not.
>
> I am open for other opinion/suggestions.

Writes are NOT always fast.  I've seen cases of write() blocking for
LONG periods of time on systems that are in the process of failing, or
just busy.  So I think we certainly want to advertise a wait event for
those.

Likewise, I agree that the prefetch call probably SHOULDN'T block, but
just because it shouldn't doesn't mean it never will.

I think somebody should try a pgbench run with this patch applied,
using a scale factor greater than shared_buffers, and generate a wait
event profile, just to see if these are showing up and how often.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] wait events for disk I/O

From

Michael Paquier

Date:

07 March 2017, 03:01:20

On Tue, Mar 7, 2017 at 8:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 6, 2017 at 3:27 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
>> Yes, I thought of adding wait event only for the sync but then recording the
>> wait event for both write and sync. I understand that OS level writes are
>> cheap but it still have some cost attached to that. Also I thought for the
>> monitoring tool being develop using this wait events, will have more useful
>> capture data if we try to collect as much info as we can. Or may be not.
>>
>> I am open for other opinion/suggestions.
>
> Writes are NOT always fast.  I've seen cases of write() blocking for
> LONG periods of time on systems that are in the process of failing, or
> just busy.  So I think we certainly want to advertise a wait event for
> those.

+1. I see that quite often really happens, and a lot particularly on
overloaded VMs. Having to debug such systems until you notice that
what is slow is just the underlying system is no fun. So being able to
show up an event state in memory where we can say that things are
stuck on I/O just with PG interface can really help in diagnostics
without having to look at storage-level logs for a first impression.
-- 
Michael

Re: [HACKERS] wait events for disk I/O

From

Amit Kapila

Date:

07 March 2017, 05:09:55

On Tue, Mar 7, 2017 at 5:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 6, 2017 at 3:27 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
>> Yes, I thought of adding wait event only for the sync but then recording the
>> wait event for both write and sync. I understand that OS level writes are
>> cheap but it still have some cost attached to that. Also I thought for the
>> monitoring tool being develop using this wait events, will have more useful
>> capture data if we try to collect as much info as we can. Or may be not.
>>
>> I am open for other opinion/suggestions.
>
> Writes are NOT always fast.  I've seen cases of write() blocking for
> LONG periods of time on systems that are in the process of failing, or
> just busy.  So I think we certainly want to advertise a wait event for
> those.
>

Sure, if you think both Writes and Reads at OS level can have some
chance of blocking in obscure cases, then we should add a wait event
for them.

> Likewise, I agree that the prefetch call probably SHOULDN'T block, but
> just because it shouldn't doesn't mean it never will.
>
> I think somebody should try a pgbench run with this patch applied,
> using a scale factor greater than shared_buffers, and generate a wait
> event profile, just to see if these are showing up and how often.
>

Yeah, that makes sense to me and we should try for both read-write and
read-only tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] wait events for disk I/O

From

Robert Haas

Date:

07 March 2017, 18:46:23

On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Sure, if you think both Writes and Reads at OS level can have some
> chance of blocking in obscure cases, then we should add a wait event
> for them.

I think writes have a chance of blocking in cases even in cases that
are not very obscure at all.

>> I think somebody should try a pgbench run with this patch applied,
>> using a scale factor greater than shared_buffers, and generate a wait
>> event profile, just to see if these are showing up and how often.
>
> Yeah, that makes sense to me and we should try for both read-write and
> read-only tests.

wfm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] wait events for disk I/O

From

Amit Kapila

Date:

08 March 2017, 05:32:41

On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Sure, if you think both Writes and Reads at OS level can have some
>> chance of blocking in obscure cases, then we should add a wait event
>> for them.
>
> I think writes have a chance of blocking in cases even in cases that
> are not very obscure at all.
>

Point taken for writes, but I think in general we should have some
criteria based on which we can decide whether to have a wait event for
a particular call. It should not happen that we have tons of wait
events and out of which, only a few are helpful in most of the cases
in real-world scenarios.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] wait events for disk I/O

From

Robert Haas

Date:

08 March 2017, 05:53:37

On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Sure, if you think both Writes and Reads at OS level can have some
>>> chance of blocking in obscure cases, then we should add a wait event
>>> for them.
>>
>> I think writes have a chance of blocking in cases even in cases that
>> are not very obscure at all.
>
> Point taken for writes, but I think in general we should have some
> criteria based on which we can decide whether to have a wait event for
> a particular call. It should not happen that we have tons of wait
> events and out of which, only a few are helpful in most of the cases
> in real-world scenarios.

Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable.  Yeah, some of those are going to be a lot more
common than others, but so what?  We add wait events so that we can
find out what's going on.  I don't want to sometimes know when a
backend is blocked on an I/O.  I want to ALWAYS know.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] wait events for disk I/O

From

Rushabh Lathia

Date:

08 March 2017, 14:20:54

On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Sure, if you think both Writes and Reads at OS level can have some
>>> chance of blocking in obscure cases, then we should add a wait event
>>> for them.
>>
>> I think writes have a chance of blocking in cases even in cases that
>> are not very obscure at all.
>
> Point taken for writes, but I think in general we should have some
> criteria based on which we can decide whether to have a wait event for
> a particular call. It should not happen that we have tons of wait
> events and out of which, only a few are helpful in most of the cases
> in real-world scenarios.

Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.

Yes, I agree with Robert. Knowing what we want and what we don't

want is difficult to judge. Something which we might think its not useful

information, and later of end up into situation where we re-think about

adding those missing stuff is not good. Having more information about

the system, specially for monitoring purpose is always good.

I am attaching another version of the patch, as I found stupid mistake

in the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

--

Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

wait_event_for_disk_IO_v2.patch

Re: [HACKERS] wait events for disk I/O

From

Rajkumar Raghuwanshi

Date:

08 March 2017, 16:11:13

On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

I have applied attached patch, set shared_buffers to 128kB and ran pgbench, I am able to see below distinct IO events.

--with ./pgbench -i -s 500 postgres
application_name    wait_event_type    wait_event              query
pgbench                  IO                        ExtendDataBlock   copy pgbench_account
pgbench                  IO                          WriteXLog             copy pgbench_account
pgbench                  IO                          WriteDataBlock     copy pgbench_account
pgbench                  IO                          ReadDataBlock     vacuum analyze pgben
pgbench                  IO                          ReadBuffile        alter table pgbench_

--with ./pgbench -T 600 postgres (readwrite)
application_name    wait_event_type    wait_event              query
pgbench                 IO                           ReadDataBlock     UPDATE pgbench_accou
pgbench                 IO                           WriteDataBlock      UPDATE pgbench_telle
                                IO                           SyncDataBlock
pgbench                 IO                           SyncDataBlock     UPDATE pgbench_telle
                                IO                           SyncDataBlock     autovacuum: VACUUM A
pgbench                 IO                           WriteXLog             END;
pgbench                 IO                           ExtendDataBlock   copy pgbench_account

--with ./pgbench -T 600 -S postgres (select only)
application_name    wait_event_type    wait_event              query
pgbench                 IO                           ReadDataBlock     SELECT abalance FROM

Attached excel with all IO event values.

Thanks & Regards,

Rajkumar Raghuwanshi

QMG, EnterpriseDB Corporation

Attachment

wait_event_for_disk_io.ods

Re: [HACKERS] wait events for disk I/O

From

Rushabh Lathia

Date:

09 March 2017, 08:24:51

Thanks Rajkumar for performing tests on this patch.

Yes, I also noticed similar results in my testing. Additionally sometime I also
noticed ReadSLRUPage event on my system.

I also run the reindex database command and I notices below IO events.

SyncImmedRelation,
WriteDataBlock
WriteBuffile,
WriteXLog,
ReadDataBlock

On Wed, Mar 8, 2017 at 6:41 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

I have applied attached patch, set shared_buffers to 128kB and ran pgbench, I am able to see below distinct IO events.

--with ./pgbench -i -s 500 postgres
application_name    wait_event_type    wait_event              query
pgbench                  IO                        ExtendDataBlock   copy pgbench_account
pgbench                  IO                          WriteXLog             copy pgbench_account
pgbench                  IO                          WriteDataBlock     copy pgbench_account
pgbench                  IO                          ReadDataBlock     vacuum analyze pgben
pgbench                  IO                          ReadBuffile        alter table pgbench_

--with ./pgbench -T 600 postgres (readwrite)
application_name    wait_event_type    wait_event              query
pgbench                 IO                           ReadDataBlock     UPDATE pgbench_accou
pgbench                 IO                           WriteDataBlock      UPDATE pgbench_telle
                                IO                           SyncDataBlock
pgbench                 IO                           SyncDataBlock     UPDATE pgbench_telle
                                IO                           SyncDataBlock     autovacuum: VACUUM A
pgbench                 IO                           WriteXLog             END;
pgbench                 IO                           ExtendDataBlock   copy pgbench_account

--with ./pgbench -T 600 -S postgres (select only)
application_name    wait_event_type    wait_event              query
pgbench                 IO                           ReadDataBlock     SELECT abalance FROM

Attached excel with all IO event values.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

--

Rushabh Lathia

www.EnterpriseDB.com

Re: [HACKERS] wait events for disk I/O

From

Rajkumar Raghuwanshi

Date:

13 March 2017, 08:24:19

On Thu, Mar 9, 2017 at 10:54 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Thanks Rajkumar for performing tests on this patch.
>
> Yes, I also noticed similar results in my testing. Additionally sometime I
> also
> noticed ReadSLRUPage event on my system.
>
> I also run the reindex database command and I notices below IO events.
>
> SyncImmedRelation,
> WriteDataBlock
> WriteBuffile,
> WriteXLog,
> ReadDataBlock

I have tried for generating some more events, by running pgbench on
master/slave configuration, able to see three more events.
WriteInitXLogFile
SyncInitXLogFile
ReadXLog

Re: [HACKERS] wait events for disk I/O

From

Rahila Syed

Date:

14 March 2017, 17:43:47

Hello,

I applied and tested this patch on latest sources and it works fine.

Following are some comments,

>+ /* Wait event for SNRU */
>+ WAIT_EVENT_READ_SLRU_PAGE,

Typo in the comment.

>FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_FLUSH_DATA_BLOCK);
This call is inside mdwriteback() which can flush more than one block so should WAIT_EVENT _FLUSH_DATA_BLOCK
be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?

Should calls to write() in following functions be tracked too?
qtext_store() - This is related to pg_stat_statements

dsm_impl_mmap() - This is in relation to creating dsm segments.

write_auto_conf_file()- This is called when updated configuration parameters are

written to a temp file.

Thank you,

Rahila Syed

On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Sure, if you think both Writes and Reads at OS level can have some
>>> chance of blocking in obscure cases, then we should add a wait event
>>> for them.
>>
>> I think writes have a chance of blocking in cases even in cases that
>> are not very obscure at all.
>
> Point taken for writes, but I think in general we should have some
> criteria based on which we can decide whether to have a wait event for
> a particular call. It should not happen that we have tons of wait
> events and out of which, only a few are helpful in most of the cases
> in real-world scenarios.

Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.

Yes, I agree with Robert. Knowing what we want and what we don't
want is difficult to judge. Something which we might think its not useful
information, and later of end up into situation where we re-think about
adding those missing stuff is not good. Having more information about
the system, specially for monitoring purpose is always good.

I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

--
Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] wait events for disk I/O

From

Rushabh Lathia

Date:

15 March 2017, 10:33:30

Thanks Rahila for reviewing this patch.

On Tue, Mar 14, 2017 at 8:13 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:

Hello,

I applied and tested this patch on latest sources and it works fine.

Following are some comments,

>+ /* Wait event for SNRU */
>+ WAIT_EVENT_READ_SLRU_PAGE,
Typo in the comment.

Fixed.

>FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_FLUSH_DATA_BLOCK);
This call is inside mdwriteback() which can flush more than one block so should WAIT_EVENT _FLUSH_DATA_BLOCK
be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?

Changed with WAIT_EVENT_FLUSH_DATA_BLOCKS.

Should calls to write() in following functions be tracked too?
qtext_store() - This is related to pg_stat_statements

I am not quite sure about this, as this is for stat statements. Also part from the

place you found there are many other fwrite() call into pg_stat_statements, and

I intentionally haven't added event here as it is very small write about stat, and

doesn't look like we should add for those call.

dsm_impl_mmap() - This is in relation to creating dsm segments.

Added new event here. Actually particular write call is zero-filling the DSM file.

write_auto_conf_file()- This is called when updated configuration parameters are
written to a temp file.

write_auto_conf_file() is getting called during the ALTER SYSTEM call. Here write
happen only when someone explicitly run the ALTER SYSTEM call. This is

administrator call and so doesn't seem like necessary to add separate wait event

for this.

PFA latest patch with other fixes.

On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Sure, if you think both Writes and Reads at OS level can have some
>>> chance of blocking in obscure cases, then we should add a wait event
>>> for them.
>>
>> I think writes have a chance of blocking in cases even in cases that
>> are not very obscure at all.
>
> Point taken for writes, but I think in general we should have some
> criteria based on which we can decide whether to have a wait event for
> a particular call. It should not happen that we have tons of wait
> events and out of which, only a few are helpful in most of the cases
> in real-world scenarios.

Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.

Yes, I agree with Robert. Knowing what we want and what we don't
want is difficult to judge. Something which we might think its not useful
information, and later of end up into situation where we re-think about
adding those missing stuff is not good. Having more information about
the system, specially for monitoring purpose is always good.

I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

--
Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Regards,

Rushabh Lathia

www.EnterpriseDB.com
The Enterprise PostgreSQL Company

Attachment

wait_event_for_disk_IO_v3.patch

Re: [HACKERS] wait events for disk I/O

From

Rahila Syed

Date:

16 March 2017, 15:28:21

Thank you for the updated patch.

I have applied and tested it on latest sources and the patch looks good to me.

>I am not quite sure about this, as this is for stat statements. Also part from the

>place you found there are many other fwrite() call into pg_stat_statements, and

>I intentionally haven't added event here as it is very small write about stat, and

>doesn't look like we should add for those call.

I agree that this writes less amount of data. Although tracking this can

be useful too in scenarios where pg_stat_statements is lagging due to I/O bottleneck.

I will leave this decision to the committer.

Thank you,

Rahila Syed

On Wed, Mar 15, 2017 at 1:03 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

Thanks Rahila for reviewing this patch.

On Tue, Mar 14, 2017 at 8:13 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
Hello,

I applied and tested this patch on latest sources and it works fine.

Following are some comments,

>+   /* Wait event for SNRU */
>+   WAIT_EVENT_READ_SLRU_PAGE,
Typo in the comment.

Fixed.

>FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_FLUSH_DATA_BLOCK);
This call is inside mdwriteback() which can flush more than one block so should WAIT_EVENT _FLUSH_DATA_BLOCK
be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?

Changed with WAIT_EVENT_FLUSH_DATA_BLOCKS.

Should calls to write() in following functions be tracked too?
qtext_store() - This is related to pg_stat_statements

I am not quite sure about this, as this is for stat statements. Also part from the
place you found there are many other fwrite() call into pg_stat_statements, and
I intentionally haven't added event here as it is very small write about stat, and
doesn't look like we should add for those call.

dsm_impl_mmap() - This is in relation to creating dsm segments.

Added new event here. Actually particular write call is zero-filling the DSM file.

write_auto_conf_file()- This is called when updated configuration parameters are
                                     written to a temp file.

write_auto_conf_file() is getting called during the ALTER SYSTEM call. Here write
happen only when someone explicitly run the ALTER SYSTEM call. This is
administrator call and so doesn't seem like necessary to add separate wait event
for this.

PFA latest patch with other fixes.

On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Sure, if you think both Writes and Reads at OS level can have some
>>> chance of blocking in obscure cases, then we should add a wait event
>>> for them.
>>
>> I think writes have a chance of blocking in cases even in cases that
>> are not very obscure at all.
>
> Point taken for writes, but I think in general we should have some
> criteria based on which we can decide whether to have a wait event for
> a particular call. It should not happen that we have tons of wait
> events and out of which, only a few are helpful in most of the cases
> in real-world scenarios.

Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.

Yes, I agree with Robert. Knowing what we want and what we don't
want is difficult to judge. Something which we might think its not useful
information, and later of end up into situation where we re-think about
adding those missing stuff is not good. Having more information about
the system, specially for monitoring purpose is always good.

I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

--
Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Regards,
Rushabh Lathia
www.EnterpriseDB.com
The Enterprise PostgreSQL Company

Re: [HACKERS] wait events for disk I/O

From

Robert Haas

Date:

16 March 2017, 17:35:29

On Thu, Mar 16, 2017 at 8:28 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Thank you for the updated patch.
>
> I have applied and tested it on latest sources and the patch looks good to
> me.

The documentation puts the new wait events in a pretty random order.
I think they should be alphabetized, like we do with the IPC events.
I also suggest we change the naming scheme so that the kind of thing
being operated on is first and this is followed by the operation name.
This will let us keep related entries next to each other after
alphabetizing.  So with that principle in mind:

- instead of ReadDataBlock etc. I propose DataFileRead, DataFileWrite,
DataFileSync, DataFileExtend, DataFileFlush, DataFilePrefetch,
DataFileTruncate.  using file instead of block avoids singular/plural
confusion.
- instead of RelationSync and RelationImmedSync I proposed
DataFileSync and DataFileImmediateSync; these are md.c operations like
the previous set, so why name it differently?
- instead of WriteRewriteDataBlock and SyncRewriteDataBlock and
TruncateLogicalMappingRewrite, which aren't consistent with each other
even though they are related, I propose LogicalRewriteWrite,
LogicalRewriteSync, and LogicalRewriteTruncate, which are also closer
to the names of the functions that contain those wait points
- for ReadBuffile and WriteBuffile seem OK, I propose BufFileRead and
BufFileWrite, again reversing the order and also tweaking the
capitalization
- in keeping with our new policy of referring to xlog as wal in user
visible interfaces, I propose WALRead, WALCopyRead, WALWrite,
WALInitWrite, WALCopyWrite, WALBootstrapWrite, WALInitSync,
WALBootstrapSync, WALSyncMethodAssign
- for control file ops, ControlFileRead, ControlFileWrite,
ControlFileWriteUpdate, ControlFileSync, ControlFileSyncUpdate
- ReadApplyLogicalMapping and friends seem to have to do with the
reorderbuffer code, so maybe ReorderBufferRead etc.
- there seems to be some discrepancy between the documentation and
pgstat_get_wait_io for the snapbuild stuff.  maybe SnapBuildWrite,
SnapBuildSync, SnapBuildRead.
- SLRURead, SLRUWrite, etc.
- TimelineHistoryRead, etc.
- the syslogger changes should be dropped, since the syslogger is not
and should not be connected to shared memory
- the replslot terminology seems like a case of odd capitalization and
overeager abbreviation.  why not ReplicationSlotRead,
ReplicationSlotWrite, etc?  similarly RelationMapRead,
RelationMapWrite, etc?
- CopyFileRead, CopyFileWrite
- LockFileCreateRead, etc.
- AddToDataDirLockFileRead is a little long and incomprehensible;
maybe LockFileUpdateRead etc.
- DSMWriteZeroBytes, maybe?

Of course the constants should be renamed to match.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] wait events for disk I/O

From

Rushabh Lathia

Date:

17 March 2017, 17:01:24

Thanks Robert for the review.

On Thu, Mar 16, 2017 at 8:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 16, 2017 at 8:28 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Thank you for the updated patch.
>
> I have applied and tested it on latest sources and the patch looks good to
> me.

The documentation puts the new wait events in a pretty random order.
I think they should be alphabetized, like we do with the IPC events.

Done.

I also suggest we change the naming scheme so that the kind of thing
being operated on is first and this is followed by the operation name.
This will let us keep related entries next to each other after
alphabetizing. So with that principle in mind:

Yes above naming scheme is more clear then the one i choose.

- instead of ReadDataBlock etc. I propose DataFileRead, DataFileWrite,
DataFileSync, DataFileExtend, DataFileFlush, DataFilePrefetch,
DataFileTruncate. using file instead of block avoids singular/plural
confusion.
- instead of RelationSync and RelationImmedSync I proposed
DataFileSync and DataFileImmediateSync; these are md.c operations like
the previous set, so why name it differently?

Yes, you are right, DataFileSync and DataFileImmediateSync make more

sense.

- instead of WriteRewriteDataBlock and SyncRewriteDataBlock and
TruncateLogicalMappingRewrite, which aren't consistent with each other
even though they are related, I propose LogicalRewriteWrite,
LogicalRewriteSync, and LogicalRewriteTruncate, which are also closer
to the names of the functions that contain those wait points
- for ReadBuffile and WriteBuffile seem OK, I propose BufFileRead and
BufFileWrite, again reversing the order and also tweaking the
capitalization
- in keeping with our new policy of referring to xlog as wal in user
visible interfaces, I propose WALRead, WALCopyRead, WALWrite,
WALInitWrite, WALCopyWrite, WALBootstrapWrite, WALInitSync,
WALBootstrapSync, WALSyncMethodAssign
- for control file ops, ControlFileRead, ControlFileWrite,
ControlFileWriteUpdate, ControlFileSync, ControlFileSyncUpdate

- ReadApplyLogicalMapping and friends seem to have to do with the
reorderbuffer code, so maybe ReorderBufferRead etc.
- there seems to be some discrepancy between the documentation and
pgstat_get_wait_io for the snapbuild stuff. maybe SnapBuildWrite,
SnapBuildSync, SnapBuildRead.
- SLRURead, SLRUWrite, etc.
- TimelineHistoryRead, etc.
- the syslogger changes should be dropped, since the syslogger is not
and should not be connected to shared memory

Ok removed.

- the replslot terminology seems like a case of odd capitalization and
overeager abbreviation. why not ReplicationSlotRead,
ReplicationSlotWrite, etc? similarly RelationMapRead,
RelationMapWrite, etc?
- CopyFileRead, CopyFileWrite
- LockFileCreateRead, etc.
- AddToDataDirLockFileRead is a little long and incomprehensible;
maybe LockFileUpdateRead etc.

How about LockFileAddToDataDirRead? even though its little long but it

gives clear understanding about what's going on.

- DSMWriteZeroBytes, maybe?

DSMFillZeroWrite? Basically want to keep the file IP operation at the end of

the event name.

Of course the constants should be renamed to match.

I tried to cover all the suggestion in the attached latest patch.

Thanks,
Rushabh Lathia

www.EnterpriseDB.com

Attachment

wait_event_for_disk_IO_v4.patch

Re: [HACKERS] wait events for disk I/O

From

Robert Haas

Date:

18 March 2017, 14:45:17

On Fri, Mar 17, 2017 at 10:01 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> I tried to cover all the suggestion in the attached latest patch.

Committed.  I reworded the documentation entries, renamed a few of the
wait events to make things more consistent, put all three lists in
rigorous alphabetical order, and I fixed a couple of places where an
error return from a system call could lead to returning without
clearing the wait event.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] wait events for disk I/O

From

Rushabh Lathia

Date:

18 March 2017, 20:22:43

On Sat, Mar 18, 2017 at 5:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 17, 2017 at 10:01 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> I tried to cover all the suggestion in the attached latest patch.

Committed. I reworded the documentation entries, renamed a few of the
wait events to make things more consistent, put all three lists in
rigorous alphabetical order, and I fixed a couple of places where an
error return from a system call could lead to returning without
clearing the wait event.

Thanks Robert.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--

Rushabh Lathia

Re: [HACKERS] wait events for disk I/O

From

Rajkumar Raghuwanshi

Date:

20 March 2017, 15:21:28

On Sat, Mar 18, 2017 at 10:52 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>
>
> On Sat, Mar 18, 2017 at 5:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Fri, Mar 17, 2017 at 10:01 AM, Rushabh Lathia
>> <rushabh.lathia@gmail.com> wrote:
>> > I tried to cover all the suggestion in the attached latest patch.
>>
>> Committed.  I reworded the documentation entries, renamed a few of the
>> wait events to make things more consistent, put all three lists in
>> rigorous alphabetical order, and I fixed a couple of places where an
>> error return from a system call could lead to returning without
>> clearing the wait event.
>>

Thanks, I ran pgbench with shared_buffers set to 128kB, able to see
below wait IO events.

DataFileRead
DataFileWrite
WALInitSync
WALInitWrite
WALWrite
DataFileSync
WALRead

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation