Thread: [HACKERS] wait events for disk I/O
Hi All,
Attached is the patch, which extend the existing wait event infrastructure to
implement the wait events for the disk I/O. Basically pg_stat_activity's wait
event information to show data about disk I/O as well as IPC primitives.
Implementation details:
- Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
- Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
FileWriteback, FileSync, and FileTruncate. Set this wait event just before
performing the file system operation and clear it just after.
- Pass down an appropriate wait event from caller of any of those
functions.
- Also set and clear a wait event around standalone calls to read(),
write(), fsync() in other parts of the system.
- Added documentation for all newly added wait event.
Open issue:
- Might missed few standalone calls to read(), write(), etc which need
to pass the wait_event_info.
Thanks to my colleague Robert Haas for his help in design.
Please let me know your thought, and thanks for reading.
Thanks,
Attached is the patch, which extend the existing wait event infrastructure to
implement the wait events for the disk I/O. Basically pg_stat_activity's wait
event information to show data about disk I/O as well as IPC primitives.
Implementation details:
- Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
- Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
FileWriteback, FileSync, and FileTruncate. Set this wait event just before
performing the file system operation and clear it just after.
- Pass down an appropriate wait event from caller of any of those
functions.
- Also set and clear a wait event around standalone calls to read(),
write(), fsync() in other parts of the system.
- Added documentation for all newly added wait event.
Open issue:
- Might missed few standalone calls to read(), write(), etc which need
to pass the wait_event_info.
Thanks to my colleague Robert Haas for his help in design.
Please let me know your thought, and thanks for reading.
Thanks,
Rushabh Lathia
Attachment
On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > Attached is the patch, which extend the existing wait event infrastructure > to implement the wait events for the disk I/O. Basically pg_stat_activity's > wait event information to show data about disk I/O as well as IPC primitives. > > Implementation details: > > - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO > - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch, > FileWriteback, FileSync, and FileTruncate. Set this wait event just before > performing the file system operation and clear it just after. > - Pass down an appropriate wait event from caller of any of those > functions. > - Also set and clear a wait event around standalone calls to read(), > write(), fsync() in other parts of the system. > - Added documentation for all newly added wait event. Looks neat, those are unlikely to overlap with other wait events. > Open issue: > - Might missed few standalone calls to read(), write(), etc which need > to pass the wait_event_info. It may be an idea to use LD_PRELOAD with custom versions of read(), write() and fsync(), and look at the paths where no flags are set in MyProc->wait_event_info, and log information when that happens. > Thanks to my colleague Robert Haas for his help in design. > Please let me know your thought, and thanks for reading. Did you consider a wrapper of the type pg_read_event() or pg_write_event(), etc.? -- Michael
On Tue, Jan 31, 2017 at 8:54 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Attached is the patch, which extend the existing wait event infrastructure
> to implement the wait events for the disk I/O. Basically pg_stat_activity's
> wait event information to show data about disk I/O as well as IPC primitives.
>
> Implementation details:
>
> - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
> - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
> FileWriteback, FileSync, and FileTruncate. Set this wait event just before
> performing the file system operation and clear it just after.
> - Pass down an appropriate wait event from caller of any of those
> functions.
> - Also set and clear a wait event around standalone calls to read(),
> write(), fsync() in other parts of the system.
> - Added documentation for all newly added wait event.
Looks neat, those are unlikely to overlap with other wait events.
Thanks.
> Open issue:
> - Might missed few standalone calls to read(), write(), etc which need
> to pass the wait_event_info.
It may be an idea to use LD_PRELOAD with custom versions of read(),
write() and fsync(), and look at the paths where no flags are set in
MyProc->wait_event_info, and log information when that happens.
Yes, may be I will try this.
> Thanks to my colleague Robert Haas for his help in design.
> Please let me know your thought, and thanks for reading.
Did you consider a wrapper of the type pg_read_event() or
pg_write_event(), etc.?
I thought on that, but eventually stick to this approach as it looks
more neat and uniform with other wait event implementation.
--
Michael
Rushabh Lathia
On Tue, Jan 31, 2017 at 11:09 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
On Tue, Jan 31, 2017 at 8:54 AM, Michael Paquier <michael.paquier@gmail.com> wrote:On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Attached is the patch, which extend the existing wait event infrastructure
> to implement the wait events for the disk I/O. Basically pg_stat_activity's
> wait event information to show data about disk I/O as well as IPC primitives.
>
> Implementation details:
>
> - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
> - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
> FileWriteback, FileSync, and FileTruncate. Set this wait event just before
> performing the file system operation and clear it just after.
> - Pass down an appropriate wait event from caller of any of those
> functions.
> - Also set and clear a wait event around standalone calls to read(),
> write(), fsync() in other parts of the system.
> - Added documentation for all newly added wait event.
Looks neat, those are unlikely to overlap with other wait events.Thanks.
> Open issue:
> - Might missed few standalone calls to read(), write(), etc which need
> to pass the wait_event_info.
It may be an idea to use LD_PRELOAD with custom versions of read(),
write() and fsync(), and look at the paths where no flags are set in
MyProc->wait_event_info, and log information when that happens.Yes, may be I will try this.> Thanks to my colleague Robert Haas for his help in design.
> Please let me know your thought, and thanks for reading.
Did you consider a wrapper of the type pg_read_event() or
pg_write_event(), etc.?I thought on that, but eventually stick to this approach as it looksmore neat and uniform with other wait event implementation.--
MichaelThanks,Rushabh Lathia
--
Rushabh Lathia
Attachment
On Mon, Feb 20, 2017 at 4:04 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > > My colleague Rahila reported compilation issue with > the patch. Issue was only coming with we do the clean > build on the branch. > > Fixed the same into latest version of patch. > Few assorted comments: 1. + <row> + <entry><literal>WriteBuffile</></entry> + <entry>Waiting during buffile read operation.</entry> + </row> Operation name and definition are not matching. 2. +FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info){#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)int returnCode; @@ -1527,8 +1527,10 @@ FilePrefetch(File file, off_t offset, int amount) if (returnCode < 0) return returnCode; + pgstat_report_wait_start(wait_event_info); returnCode = posix_fadvise(VfdCache[file].fd, offset, amount, POSIX_FADV_WILLNEED); + pgstat_report_wait_end(); AFAIK, this call is non-blocking and will just initiate a read, so do you think we should record wait event for such a call. 3. - written = FileWrite(src->vfd, waldata_start, len); + written = FileWrite(src->vfd, waldata_start, len, + WAIT_EVENT_WRITE_REWRITE_DATA_BLOCK); if (written != len) ereport(ERROR, (errcode_for_file_access(), @@ -957,7 +960,7 @@ logical_end_heap_rewrite(RewriteState state) hash_seq_init(&seq_status, state->rs_logical_mappings);while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL) { - if (FileSync(src->vfd) != 0) + if (FileSync(src->vfd, WAIT_EVENT_SYNC_REWRITE_DATA_BLOCK) != 0) Do we want to consider recording wait event for both write and sync? It seems to me OS level writes are relatively cheap and sync calls are expensive, so shouldn't we just log for sync calls? I could see the similar usage at multiple places in the patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Mar 4, 2017 at 7:53 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Feb 20, 2017 at 4:04 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>
> My colleague Rahila reported compilation issue with
> the patch. Issue was only coming with we do the clean
> build on the branch.
>
> Fixed the same into latest version of patch.
>
Few assorted comments:
1.
+ <row>
+ <entry><literal>WriteBuffile</></entry>
+ <entry>Waiting during
buffile read operation.</entry>
+ </row>
Operation name and definition are not matching.
Will fix this.
2.
+FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info)
{
#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
int returnCode;
@@ -1527,8 +1527,10 @@ FilePrefetch(File file, off_t offset, int amount)
if (returnCode < 0)
return returnCode;
+ pgstat_report_wait_start(wait_event_info);
returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
POSIX_FADV_WILLNEED);
+ pgstat_report_wait_end();
AFAIK, this call is non-blocking and will just initiate a read, so do
you think we should record wait event for such a call.
Yes, prefecth call is a non-blocking and will just initiate a read. But this info
about the prefetch into wait events will give more info about the system.
3.
- written = FileWrite(src->vfd, waldata_start, len);
+ written = FileWrite(src->vfd, waldata_start, len,
+ WAIT_EVENT_WRITE_REWRITE_DATA_BLOCK);
if (written != len)
ereport(ERROR,
(errcode_for_file_access(),
@@ -957,7 +960,7 @@ logical_end_heap_rewrite(RewriteState state)
hash_seq_init(&seq_status, state->rs_logical_mappings);
while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
{
- if (FileSync(src->vfd) != 0)
+ if (FileSync(src->vfd, WAIT_EVENT_SYNC_REWRITE_DATA_BLOCK) != 0)
Do we want to consider recording wait event for both write and sync?
It seems to me OS level writes are relatively cheap and sync calls are
expensive, so shouldn't we just log for sync calls? I could see the
similar usage at multiple places in the patch.
Yes, I thought of adding wait event only for the sync but then recording the
wait event for both write and sync. I understand that OS level writes are
cheap but it still have some cost attached to that. Also I thought for the
monitoring tool being develop using this wait events, will have more useful
capture data if we try to collect as much info as we can. Or may be not.
wait event for both write and sync. I understand that OS level writes are
cheap but it still have some cost attached to that. Also I thought for the
monitoring tool being develop using this wait events, will have more useful
capture data if we try to collect as much info as we can. Or may be not.
I am open for other opinion/suggestions.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Rushabh Lathia
On Mon, Mar 6, 2017 at 3:27 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > Yes, I thought of adding wait event only for the sync but then recording the > wait event for both write and sync. I understand that OS level writes are > cheap but it still have some cost attached to that. Also I thought for the > monitoring tool being develop using this wait events, will have more useful > capture data if we try to collect as much info as we can. Or may be not. > > I am open for other opinion/suggestions. Writes are NOT always fast. I've seen cases of write() blocking for LONG periods of time on systems that are in the process of failing, or just busy. So I think we certainly want to advertise a wait event for those. Likewise, I agree that the prefetch call probably SHOULDN'T block, but just because it shouldn't doesn't mean it never will. I think somebody should try a pgbench run with this patch applied, using a scale factor greater than shared_buffers, and generate a wait event profile, just to see if these are showing up and how often. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 7, 2017 at 8:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 6, 2017 at 3:27 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: >> Yes, I thought of adding wait event only for the sync but then recording the >> wait event for both write and sync. I understand that OS level writes are >> cheap but it still have some cost attached to that. Also I thought for the >> monitoring tool being develop using this wait events, will have more useful >> capture data if we try to collect as much info as we can. Or may be not. >> >> I am open for other opinion/suggestions. > > Writes are NOT always fast. I've seen cases of write() blocking for > LONG periods of time on systems that are in the process of failing, or > just busy. So I think we certainly want to advertise a wait event for > those. +1. I see that quite often really happens, and a lot particularly on overloaded VMs. Having to debug such systems until you notice that what is slow is just the underlying system is no fun. So being able to show up an event state in memory where we can say that things are stuck on I/O just with PG interface can really help in diagnostics without having to look at storage-level logs for a first impression. -- Michael
On Tue, Mar 7, 2017 at 5:27 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 6, 2017 at 3:27 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: >> Yes, I thought of adding wait event only for the sync but then recording the >> wait event for both write and sync. I understand that OS level writes are >> cheap but it still have some cost attached to that. Also I thought for the >> monitoring tool being develop using this wait events, will have more useful >> capture data if we try to collect as much info as we can. Or may be not. >> >> I am open for other opinion/suggestions. > > Writes are NOT always fast. I've seen cases of write() blocking for > LONG periods of time on systems that are in the process of failing, or > just busy. So I think we certainly want to advertise a wait event for > those. > Sure, if you think both Writes and Reads at OS level can have some chance of blocking in obscure cases, then we should add a wait event for them. > Likewise, I agree that the prefetch call probably SHOULDN'T block, but > just because it shouldn't doesn't mean it never will. > > I think somebody should try a pgbench run with this patch applied, > using a scale factor greater than shared_buffers, and generate a wait > event profile, just to see if these are showing up and how often. > Yeah, that makes sense to me and we should try for both read-write and read-only tests. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Sure, if you think both Writes and Reads at OS level can have some > chance of blocking in obscure cases, then we should add a wait event > for them. I think writes have a chance of blocking in cases even in cases that are not very obscure at all. >> I think somebody should try a pgbench run with this patch applied, >> using a scale factor greater than shared_buffers, and generate a wait >> event profile, just to see if these are showing up and how often. > > Yeah, that makes sense to me and we should try for both read-write and > read-only tests. wfm. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Sure, if you think both Writes and Reads at OS level can have some >> chance of blocking in obscure cases, then we should add a wait event >> for them. > > I think writes have a chance of blocking in cases even in cases that > are not very obscure at all. > Point taken for writes, but I think in general we should have some criteria based on which we can decide whether to have a wait event for a particular call. It should not happen that we have tons of wait events and out of which, only a few are helpful in most of the cases in real-world scenarios. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Sure, if you think both Writes and Reads at OS level can have some >>> chance of blocking in obscure cases, then we should add a wait event >>> for them. >> >> I think writes have a chance of blocking in cases even in cases that >> are not very obscure at all. > > Point taken for writes, but I think in general we should have some > criteria based on which we can decide whether to have a wait event for > a particular call. It should not happen that we have tons of wait > events and out of which, only a few are helpful in most of the cases > in real-world scenarios. Well, the problem is that if you pick and choose which wait events to add based on what you think will be common, you're actually kind of hosing yourself. Because now when something uncommon happens, suddenly you don't get any wait event data and you can't tell what's happening. I think the number of new wait events added by Rushabh's patch is wholly reasonable. Yeah, some of those are going to be a lot more common than others, but so what? We add wait events so that we can find out what's going on. I don't want to sometimes know when a backend is blocked on an I/O. I want to ALWAYS know. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Sure, if you think both Writes and Reads at OS level can have some
>>> chance of blocking in obscure cases, then we should add a wait event
>>> for them.
>>
>> I think writes have a chance of blocking in cases even in cases that
>> are not very obscure at all.
>
> Point taken for writes, but I think in general we should have some
> criteria based on which we can decide whether to have a wait event for
> a particular call. It should not happen that we have tons of wait
> events and out of which, only a few are helpful in most of the cases
> in real-world scenarios.
Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.
Yes, I agree with Robert. Knowing what we want and what we don't
want is difficult to judge. Something which we might think its not useful
information, and later of end up into situation where we re-think about
adding those missing stuff is not good. Having more information about
the system, specially for monitoring purpose is always good.
I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.
--
Attachment
On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
I have applied attached patch, set shared_buffers to 128kB and ran pgbench, I am able to see below distinct IO events.
--with ./pgbench -i -s 500 postgres
application_name wait_event_type wait_event query
pgbench IO ExtendDataBlock copy pgbench_account
pgbench IO WriteXLog copy pgbench_account
pgbench IO WriteDataBlock copy pgbench_account
pgbench IO ReadDataBlock vacuum analyze pgben
pgbench IO ReadBuffile alter table pgbench_
--with ./pgbench -T 600 postgres (readwrite)
application_name wait_event_type wait_event query
pgbench IO ReadDataBlock UPDATE pgbench_accou
pgbench IO WriteDataBlock UPDATE pgbench_telle
IO SyncDataBlock
pgbench IO SyncDataBlock UPDATE pgbench_telle
IO SyncDataBlock autovacuum: VACUUM A
pgbench IO WriteXLog END;
pgbench IO ExtendDataBlock copy pgbench_account
--with ./pgbench -T 600 -S postgres (select only)
application_name wait_event_type wait_event query
pgbench IO ReadDataBlock SELECT abalance FROM
I am attaching another version of the patch, as I found stupid mistakein the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.
I have applied attached patch, set shared_buffers to 128kB and ran pgbench, I am able to see below distinct IO events.
--with ./pgbench -i -s 500 postgres
application_name wait_event_type wait_event query
pgbench IO ExtendDataBlock copy pgbench_account
pgbench IO WriteXLog copy pgbench_account
pgbench IO WriteDataBlock copy pgbench_account
pgbench IO ReadDataBlock vacuum analyze pgben
pgbench IO ReadBuffile alter table pgbench_
--with ./pgbench -T 600 postgres (readwrite)
application_name wait_event_type wait_event query
pgbench IO ReadDataBlock UPDATE pgbench_accou
pgbench IO WriteDataBlock UPDATE pgbench_telle
IO SyncDataBlock
pgbench IO SyncDataBlock UPDATE pgbench_telle
IO SyncDataBlock autovacuum: VACUUM A
pgbench IO WriteXLog END;
pgbench IO ExtendDataBlock copy pgbench_account
--with ./pgbench -T 600 -S postgres (select only)
application_name wait_event_type wait_event query
pgbench IO ReadDataBlock SELECT abalance FROM
Attached excel with all IO event values.
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB CorporationAttachment
Thanks Rajkumar for performing tests on this patch.
Yes, I also noticed similar results in my testing. Additionally sometime I alsonoticed ReadSLRUPage event on my system.
I also run the reindex database command and I notices below IO events.
SyncImmedRelation,
WriteDataBlock
WriteBuffile,
WriteXLog,
ReadDataBlock
On Wed, Mar 8, 2017 at 6:41 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:I am attaching another version of the patch, as I found stupid mistakein the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.
I have applied attached patch, set shared_buffers to 128kB and ran pgbench, I am able to see below distinct IO events.
--with ./pgbench -i -s 500 postgres
application_name wait_event_type wait_event query
pgbench IO ExtendDataBlock copy pgbench_account
pgbench IO WriteXLog copy pgbench_account
pgbench IO WriteDataBlock copy pgbench_account
pgbench IO ReadDataBlock vacuum analyze pgben
pgbench IO ReadBuffile alter table pgbench_
--with ./pgbench -T 600 postgres (readwrite)
application_name wait_event_type wait_event query
pgbench IO ReadDataBlock UPDATE pgbench_accou
pgbench IO WriteDataBlock UPDATE pgbench_telle
IO SyncDataBlock
pgbench IO SyncDataBlock UPDATE pgbench_telle
IO SyncDataBlock autovacuum: VACUUM A
pgbench IO WriteXLog END;
pgbench IO ExtendDataBlock copy pgbench_account
--with ./pgbench -T 600 -S postgres (select only)
application_name wait_event_type wait_event query
pgbench IO ReadDataBlock SELECT abalance FROMAttached excel with all IO event values.Thanks & Regards,Rajkumar RaghuwanshiQMG, EnterpriseDB Corporation
--
Rushabh Lathia
On Thu, Mar 9, 2017 at 10:54 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > Thanks Rajkumar for performing tests on this patch. > > Yes, I also noticed similar results in my testing. Additionally sometime I > also > noticed ReadSLRUPage event on my system. > > I also run the reindex database command and I notices below IO events. > > SyncImmedRelation, > WriteDataBlock > WriteBuffile, > WriteXLog, > ReadDataBlock I have tried for generating some more events, by running pgbench on master/slave configuration, able to see three more events. WriteInitXLogFile SyncInitXLogFile ReadXLog
Hello,
I applied and tested this patch on latest sources and it works fine.>+ /* Wait event for SNRU */
>+ WAIT_EVENT_READ_SLRU_PAGE,
>FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_FLUSH_DATA_BLOCK);
This call is inside mdwriteback() which can flush more than one block so should WAIT_EVENT _FLUSH_DATA_BLOCK
be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?
Should calls to write() in following functions be tracked too?
qtext_store() - This is related to pg_stat_statements
dsm_impl_mmap() - This is in relation to creating dsm segments.
write_auto_conf_file()- This is called when updated configuration parameters are
qtext_store() - This is related to pg_stat_statements
dsm_impl_mmap() - This is in relation to creating dsm segments.
write_auto_conf_file()- This is called when updated configuration parameters are
written to a temp file.
Thank you,
Rahila Syed
On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Sure, if you think both Writes and Reads at OS level can have some
>>> chance of blocking in obscure cases, then we should add a wait event
>>> for them.
>>
>> I think writes have a chance of blocking in cases even in cases that
>> are not very obscure at all.
>
> Point taken for writes, but I think in general we should have some
> criteria based on which we can decide whether to have a wait event for
> a particular call. It should not happen that we have tons of wait
> events and out of which, only a few are helpful in most of the cases
> in real-world scenarios.
Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.Yes, I agree with Robert. Knowing what we want and what we don'twant is difficult to judge. Something which we might think its not usefulinformation, and later of end up into situation where we re-think aboutadding those missing stuff is not good. Having more information aboutthe system, specially for monitoring purpose is always good.I am attaching another version of the patch, as I found stupid mistakein the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.
--Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks Rahila for reviewing this patch.
On Tue, Mar 14, 2017 at 8:13 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:
Typo in the comment.Following are some comments,Hello,I applied and tested this patch on latest sources and it works fine.
>+ /* Wait event for SNRU */
>+ WAIT_EVENT_READ_SLRU_PAGE,
Fixed.
>FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_FLUSH_DATA_BLOCK);
This call is inside mdwriteback() which can flush more than one block so should WAIT_EVENT _FLUSH_DATA_BLOCK
be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?
Changed with WAIT_EVENT_FLUSH_DATA_BLOCKS.
Should calls to write() in following functions be tracked too?
qtext_store() - This is related to pg_stat_statements
I am not quite sure about this, as this is for stat statements. Also part from the
place you found there are many other fwrite() call into pg_stat_statements, and
I intentionally haven't added event here as it is very small write about stat, and
doesn't look like we should add for those call.
dsm_impl_mmap() - This is in relation to creating dsm segments.
Added new event here. Actually particular write call is zero-filling the DSM file.
write_auto_conf_file()- This is called when updated configuration parameters arewritten to a temp file.
write_auto_conf_file() is getting called during the ALTER SYSTEM call. Here write
happen only when someone explicitly run the ALTER SYSTEM call. This is
happen only when someone explicitly run the ALTER SYSTEM call. This is
administrator call and so doesn't seem like necessary to add separate wait event
for this.
PFA latest patch with other fixes.
On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:--On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Sure, if you think both Writes and Reads at OS level can have some
>>> chance of blocking in obscure cases, then we should add a wait event
>>> for them.
>>
>> I think writes have a chance of blocking in cases even in cases that
>> are not very obscure at all.
>
> Point taken for writes, but I think in general we should have some
> criteria based on which we can decide whether to have a wait event for
> a particular call. It should not happen that we have tons of wait
> events and out of which, only a few are helpful in most of the cases
> in real-world scenarios.
Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.Yes, I agree with Robert. Knowing what we want and what we don'twant is difficult to judge. Something which we might think its not usefulinformation, and later of end up into situation where we re-think aboutadding those missing stuff is not good. Having more information aboutthe system, specially for monitoring purpose is always good.I am attaching another version of the patch, as I found stupid mistakein the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.
--Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Regards,
Rushabh Lathia
Attachment
Thank you for the updated patch.
I have applied and tested it on latest sources and the patch looks good to me.>I am not quite sure about this, as this is for stat statements. Also part from the
>place you found there are many other fwrite() call into pg_stat_statements, and
>I intentionally haven't added event here as it is very small write about stat, and
>doesn't look like we should add for those call.
I agree that this writes less amount of data. Although tracking this canbe useful too in scenarios where pg_stat_statements is lagging due to I/O bottleneck.
I will leave this decision to the committer.
Thank you,
Rahila Syed
On Wed, Mar 15, 2017 at 1:03 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
Thanks Rahila for reviewing this patch.On Tue, Mar 14, 2017 at 8:13 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:Typo in the comment.Following are some comments,Hello,I applied and tested this patch on latest sources and it works fine.
>+ /* Wait event for SNRU */
>+ WAIT_EVENT_READ_SLRU_PAGE,Fixed.
>FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_FLUSH_DATA_BLOCK);
This call is inside mdwriteback() which can flush more than one block so should WAIT_EVENT _FLUSH_DATA_BLOCK
be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?Changed with WAIT_EVENT_FLUSH_DATA_BLOCKS.
Should calls to write() in following functions be tracked too?
qtext_store() - This is related to pg_stat_statements
I am not quite sure about this, as this is for stat statements. Also part from theplace you found there are many other fwrite() call into pg_stat_statements, andI intentionally haven't added event here as it is very small write about stat, anddoesn't look like we should add for those call.dsm_impl_mmap() - This is in relation to creating dsm segments.Added new event here. Actually particular write call is zero-filling the DSM file.
write_auto_conf_file()- This is called when updated configuration parameters arewritten to a temp file. write_auto_conf_file() is getting called during the ALTER SYSTEM call. Here write
happen only when someone explicitly run the ALTER SYSTEM call. This isadministrator call and so doesn't seem like necessary to add separate wait eventfor this.PFA latest patch with other fixes.
On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:--On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Sure, if you think both Writes and Reads at OS level can have some
>>> chance of blocking in obscure cases, then we should add a wait event
>>> for them.
>>
>> I think writes have a chance of blocking in cases even in cases that
>> are not very obscure at all.
>
> Point taken for writes, but I think in general we should have some
> criteria based on which we can decide whether to have a wait event for
> a particular call. It should not happen that we have tons of wait
> events and out of which, only a few are helpful in most of the cases
> in real-world scenarios.
Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.Yes, I agree with Robert. Knowing what we want and what we don'twant is difficult to judge. Something which we might think its not usefulinformation, and later of end up into situation where we re-think aboutadding those missing stuff is not good. Having more information aboutthe system, specially for monitoring purpose is always good.I am attaching another version of the patch, as I found stupid mistakein the earlier version of patch, where I missed to initialize initial value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.
--Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers Regards,Rushabh Lathia
On Thu, Mar 16, 2017 at 8:28 AM, Rahila Syed <rahilasyed90@gmail.com> wrote: > Thank you for the updated patch. > > I have applied and tested it on latest sources and the patch looks good to > me. The documentation puts the new wait events in a pretty random order. I think they should be alphabetized, like we do with the IPC events. I also suggest we change the naming scheme so that the kind of thing being operated on is first and this is followed by the operation name. This will let us keep related entries next to each other after alphabetizing. So with that principle in mind: - instead of ReadDataBlock etc. I propose DataFileRead, DataFileWrite, DataFileSync, DataFileExtend, DataFileFlush, DataFilePrefetch, DataFileTruncate. using file instead of block avoids singular/plural confusion. - instead of RelationSync and RelationImmedSync I proposed DataFileSync and DataFileImmediateSync; these are md.c operations like the previous set, so why name it differently? - instead of WriteRewriteDataBlock and SyncRewriteDataBlock and TruncateLogicalMappingRewrite, which aren't consistent with each other even though they are related, I propose LogicalRewriteWrite, LogicalRewriteSync, and LogicalRewriteTruncate, which are also closer to the names of the functions that contain those wait points - for ReadBuffile and WriteBuffile seem OK, I propose BufFileRead and BufFileWrite, again reversing the order and also tweaking the capitalization - in keeping with our new policy of referring to xlog as wal in user visible interfaces, I propose WALRead, WALCopyRead, WALWrite, WALInitWrite, WALCopyWrite, WALBootstrapWrite, WALInitSync, WALBootstrapSync, WALSyncMethodAssign - for control file ops, ControlFileRead, ControlFileWrite, ControlFileWriteUpdate, ControlFileSync, ControlFileSyncUpdate - ReadApplyLogicalMapping and friends seem to have to do with the reorderbuffer code, so maybe ReorderBufferRead etc. - there seems to be some discrepancy between the documentation and pgstat_get_wait_io for the snapbuild stuff. maybe SnapBuildWrite, SnapBuildSync, SnapBuildRead. - SLRURead, SLRUWrite, etc. - TimelineHistoryRead, etc. - the syslogger changes should be dropped, since the syslogger is not and should not be connected to shared memory - the replslot terminology seems like a case of odd capitalization and overeager abbreviation. why not ReplicationSlotRead, ReplicationSlotWrite, etc? similarly RelationMapRead, RelationMapWrite, etc? - CopyFileRead, CopyFileWrite - LockFileCreateRead, etc. - AddToDataDirLockFileRead is a little long and incomprehensible; maybe LockFileUpdateRead etc. - DSMWriteZeroBytes, maybe? Of course the constants should be renamed to match. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks Robert for the review.
On Thu, Mar 16, 2017 at 8:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
DSMFillZeroWrite? Basically want to keep the file IP operation at the end of
On Thu, Mar 16, 2017 at 8:28 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:
> Thank you for the updated patch.
>
> I have applied and tested it on latest sources and the patch looks good to
> me.
The documentation puts the new wait events in a pretty random order.
I think they should be alphabetized, like we do with the IPC events.
Done.
I also suggest we change the naming scheme so that the kind of thing
being operated on is first and this is followed by the operation name.
This will let us keep related entries next to each other after
alphabetizing. So with that principle in mind:
Yes above naming scheme is more clear then the one i choose.
- instead of ReadDataBlock etc. I propose DataFileRead, DataFileWrite,
DataFileSync, DataFileExtend, DataFileFlush, DataFilePrefetch,
DataFileTruncate. using file instead of block avoids singular/plural
confusion.
- instead of RelationSync and RelationImmedSync I proposed
DataFileSync and DataFileImmediateSync; these are md.c operations like
the previous set, so why name it differently?
Yes, you are right, DataFileSync and DataFileImmediateSync make more
sense.
- instead of WriteRewriteDataBlock and SyncRewriteDataBlock and
TruncateLogicalMappingRewrite, which aren't consistent with each other
even though they are related, I propose LogicalRewriteWrite,
LogicalRewriteSync, and LogicalRewriteTruncate, which are also closer
to the names of the functions that contain those wait points
- for ReadBuffile and WriteBuffile seem OK, I propose BufFileRead and
BufFileWrite, again reversing the order and also tweaking the
capitalization
- in keeping with our new policy of referring to xlog as wal in user
visible interfaces, I propose WALRead, WALCopyRead, WALWrite,
WALInitWrite, WALCopyWrite, WALBootstrapWrite, WALInitSync,
WALBootstrapSync, WALSyncMethodAssign
- for control file ops, ControlFileRead, ControlFileWrite,
ControlFileWriteUpdate, ControlFileSync, ControlFileSyncUpdate
- ReadApplyLogicalMapping and friends seem to have to do with the
reorderbuffer code, so maybe ReorderBufferRead etc.
- there seems to be some discrepancy between the documentation and
pgstat_get_wait_io for the snapbuild stuff. maybe SnapBuildWrite,
SnapBuildSync, SnapBuildRead.
- SLRURead, SLRUWrite, etc.
- TimelineHistoryRead, etc.
- the syslogger changes should be dropped, since the syslogger is not
and should not be connected to shared memory
Ok removed.
- the replslot terminology seems like a case of odd capitalization and
overeager abbreviation. why not ReplicationSlotRead,
ReplicationSlotWrite, etc? similarly RelationMapRead,
RelationMapWrite, etc?
- CopyFileRead, CopyFileWrite
- LockFileCreateRead, etc.
- AddToDataDirLockFileRead is a little long and incomprehensible;
maybe LockFileUpdateRead etc.
How about LockFileAddToDataDirRead? even though its little long but it
gives clear understanding about what's going on.
- DSMWriteZeroBytes, maybe?
DSMFillZeroWrite? Basically want to keep the file IP operation at the end of
the event name.
Of course the constants should be renamed to match.
I tried to cover all the suggestion in the attached latest patch.
Thanks,
Rushabh Lathia
Rushabh Lathia
Attachment
On Fri, Mar 17, 2017 at 10:01 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > I tried to cover all the suggestion in the attached latest patch. Committed. I reworded the documentation entries, renamed a few of the wait events to make things more consistent, put all three lists in rigorous alphabetical order, and I fixed a couple of places where an error return from a system call could lead to returning without clearing the wait event. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 18, 2017 at 5:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Mar 17, 2017 at 10:01 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> I tried to cover all the suggestion in the attached latest patch.
Committed. I reworded the documentation entries, renamed a few of the
wait events to make things more consistent, put all three lists in
rigorous alphabetical order, and I fixed a couple of places where an
error return from a system call could lead to returning without
clearing the wait event.
Thanks Robert.
Rushabh Lathia
On Sat, Mar 18, 2017 at 10:52 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > > > On Sat, Mar 18, 2017 at 5:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Fri, Mar 17, 2017 at 10:01 AM, Rushabh Lathia >> <rushabh.lathia@gmail.com> wrote: >> > I tried to cover all the suggestion in the attached latest patch. >> >> Committed. I reworded the documentation entries, renamed a few of the >> wait events to make things more consistent, put all three lists in >> rigorous alphabetical order, and I fixed a couple of places where an >> error return from a system call could lead to returning without >> clearing the wait event. >> Thanks, I ran pgbench with shared_buffers set to 128kB, able to see below wait IO events. DataFileRead DataFileWrite WALInitSync WALInitWrite WALWrite DataFileSync WALRead Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation