Many DataFileRead - IO waits - Mailing list pgsql-performance

From Ben Snaidero
Subject Many DataFileRead - IO waits
Date
Msg-id CAEPE5bNNkFhoPRufSTrPZVA-W+bbu+LgXUQ1OzD9VwMowV5=VA@mail.gmail.com
Whole thread Raw
Responses Re: Many DataFileRead - IO waits  (Michael Lewis <mlewis@entrata.com>)
Re: Many DataFileRead - IO waits  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-performance
Running into a strange issue that just popped up on a few servers in my environment and was wondering if the community had any insight into as to what could be causing the issue.

First, a bit of background. I am running Postgres 10.11 on Windows (but have seen similar issue on a server running 11.6)
Windows Version:
Major  Minor  Build  Revision
-----  -----  -----  --------
10     0      14393  0  

I have the following query that was on average running in ~2ms suddenly jump up to on average ~25ms.  This query is called millions of time per day and there were cases of the query taking 20-30 seconds.  Below is the explain analyze of one such example.
When seeing this issue, the server was under some CPU pressure but even with that, I would not think it should get as slow as shown below as we are using SSDs and none of the windows disk counters (IOPS, queue length) show any value that would be of concern.

explain (analyze,buffers) SELECT tabledata.uuid_id,tabledata.int_id,tabledata.timestamp_date,tabledata.int_otherid,tabledata.float_value,tabledata.int_otherid2,tabledata.int_otherid3,tabledata.int_rowver
FROM tabledata WHERE timestamp_date <= $1 AND int_otherid3 IN ($2,$3,$4,$5,$6,$7) AND tabledata.int_id=$8 ORDER BY timestamp_date DESC LIMIT 1

QUERY PLAN                                                                      
 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.71..139.67 rows=1 width=84) (actual time=17719.076..17719.077 rows=1 loops=1)
       Buffers: shared hit=12102 read=13259 written=111
       ->  Index Scan Backward using ix_tabledata_intid_timestampdate_intotherid3_intotherid2 on tabledata  (cost=0.71..2112828.54 rows=15204 width=84) (actual time=17719.056..17719.057 rows=1 loops=1)
             Index Cond: ((int_id = 8149) AND (timestamp_date <= '2020-02-24 03:05:00.013'::timestamp without time zone))
             Filter: (int_otherid3 = ANY ('{3ad2b707-a068-42e8-b0f2-6c8570953760,4e1b1bfa-34e1-48df-8cf8-2b59caf076e2,00d394dd-c2f4-4f3a-a8d4-dc208dafa686,baa904a6-8302-4fa3-b8ae-8adce8fe4306,3c99d61b-21a1-42ea-92a8-3cc88d79f3f1,befe0f8b-5911-47b3-bfae-faa9f8b09d08}'::uuid[]))
             Rows Removed by Filter: 91686
             Buffers: shared hit=12102 read=13259 written=111
 Planning time: 203.153 ms
 Execution time: 17719.200 ms
(9 rows)

If I look at pg_stat_activity while the query is running all of the calls to this query have the same wait event.
wait_event - DataFileRead
wait_event_type - IO

We took a perfview during the issue and below is the call stack from a process running this query, two call paths are shown.
---------------------------------------------------------------

Name
 ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!MiChargeWsles
|+ ntoskrnl!MiObtainSystemCacheView
||+ ntoskrnl!MmMapViewInSystemCache
|| + ntoskrnl!CcGetVacbMiss
||  + ntoskrnl!CcGetVirtualAddress
||   + ntoskrnl!CcMapAndCopyFromCache
||    + ntoskrnl!CcCopyReadEx
||     + ntfs!NtfsCopyReadA
||     |+ fltmgr!FltpPerformFastIoCall
||     | + fltmgr!FltpPassThroughFastIo
||     |  + fltmgr!FltpFastIoRead
||     |   + ntoskrnl!NtReadFile
||     |    + ntdll!NtReadFile
||     |     + kernelbase!ReadFile
||     |      + msvcr120!_read_nolock
||     |       + msvcr120!_read
||     |        + postgres!PathNameOpenFile
||     |         + postgres!??mdclose
||     |          + postgres!ScheduleBufferTagForWriteback
||     |           + postgres!InitBufTable
||     |            + postgres!??PrefetchBuffer
||     |            |+ postgres!index_getnext_tid
||     |            | + postgres!index_fetch_heap
||     |            |  + postgres!ExecIndexEvalRuntimeKeys
||     |            |   + postgres!ExecAssignScanProjectionInfoWithVarno
||     |            |    + postgres!tupledesc_match
||     |            |     + postgres!recompute_limits
||     |            |      + postgres!CheckValidRowMarkRel
||     |            |       + postgres!list_length
||     |            |        + pg_stat_statements!pgss_ExecutorRun
||     |            |         + postgres!PortalRunFetch
||     |            |          + postgres!PortalStart
||     |            |           + postgres!exec_bind_message
||     |            |            + postgres!PostgresMain
||     |            |             + postgres!BackendInitialize
||     |            |              + postgres!ClosePostmasterPorts
||     |            |               + postgres!main
||     |            |                + postgres!_onexit
||     |            |                 + kernel32!BaseThreadInitThunk
||     |            |                  + ntdll!RtlUserThreadStart


Name
 ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!MiChargeWsles
|+ ntoskrnl!MiReleaseSystemCacheView
| + ntoskrnl!MmUnmapViewInSystemCache
|  + ntoskrnl!CcUnmapVacb
|   + ntoskrnl!CcUnmapVacbArray
|    + ntoskrnl!CcGetVirtualAddress
|     + ntoskrnl!CcMapAndCopyFromCache
|      + ntoskrnl!CcCopyReadEx
|       + ntfs!NtfsCopyReadA
|       |+ fltmgr!FltpPerformFastIoCall
|       | + fltmgr!FltpPassThroughFastIo
|       |  + fltmgr!FltpFastIoRead
|       |   + ntoskrnl!NtReadFile
|       |    + ntdll!NtReadFile
|       |    |+ kernelbase!ReadFile
|       |    | + msvcr120!_read_nolock
|       |    |  + msvcr120!_read
|       |    |   + postgres!PathNameOpenFile
|       |    |    + postgres!??mdclose
|       |    |     + postgres!ScheduleBufferTagForWriteback
|       |    |      + postgres!InitBufTable
|       |    |       + postgres!??PrefetchBuffer
|       |    |       |+ postgres!index_getnext_tid
|       |    |       | + postgres!index_fetch_heap
|       |    |       |  + postgres!ExecIndexEvalRuntimeKeys
|       |    |       |   + postgres!ExecAssignScanProjectionInfoWithVarno
|       |    |       |    + postgres!tupledesc_match
|       |    |       |     + postgres!recompute_limits
|       |    |       |      + postgres!CheckValidRowMarkRel
|       |    |       |       + postgres!list_length
|       |    |       |        + pg_stat_statements!pgss_ExecutorRun
|       |    |       |         + postgres!PortalRunFetch
|       |    |       |          + postgres!PortalStart
|       |    |       |           + postgres!exec_bind_message
|       |    |       |            + postgres!PostgresMain
|       |    |       |             + postgres!BackendInitialize
|       |    |       |              + postgres!ClosePostmasterPorts
|       |    |       |               + postgres!main
|       |    |       |                + postgres!_onexit
|       |    |       |                 + kernel32!BaseThreadInitThunk
|       |    |       |                  + ntdll!RtlUserThreadStart



If I do a top down (ie from when the process started where did we spend the most time) I get:
Name
 ROOT
+ Process64 postgres (16668) Args:  "--forkbackend" "43216"
 + Thread (16672) CPU=9399ms
 |+ ntdll!RtlUserThreadStart
 ||+ kernel32!BaseThreadInitThunk
 || + postgres!_onexit
 ||  + postgres!main
 ||   + postgres!ClosePostmasterPorts
 ||    + postgres!BackendInitialize
 ||     + postgres!PostgresMain
 ||      + postgres!exec_bind_message
 ||       + postgres!PortalStart
 ||        + postgres!PortalRunFetch
 ||         + pg_stat_statements!pgss_ExecutorRun
 ||          + postgres!list_length
 ||           + postgres!CheckValidRowMarkRel
 ||            + postgres!recompute_limits
 ||             + postgres!tupledesc_match
 ||              + postgres!ExecAssignScanProjectionInfoWithVarno
 ||              |+ postgres!ExecIndexEvalRuntimeKeys
 ||              ||+ postgres!index_fetch_heap
 ||              || + postgres!index_getnext_tid
 ||              || |+ postgres!??PrefetchBuffer
 ||              || ||+ postgres!InitBufTable
 ||              || |||+ postgres!ScheduleBufferTagForWriteback
 ||              || ||||+ postgres!??mdclose
 ||              || |||||+ postgres!PathNameOpenFile
 ||              || ||||||+ msvcr120!_read
 ||              || |||||| + msvcr120!_read_nolock
 ||              || |||||| |+ kernelbase!ReadFile
 ||              || |||||| ||+ ntdll!NtReadFile
 ||              || |||||| || + ntoskrnl!NtReadFile
 ||              || |||||| || |+ fltmgr!FltpFastIoRead
 ||              || |||||| || ||+ fltmgr!FltpPassThroughFastIo
 ||              || |||||| || |||+ fltmgr!FltpPerformFastIoCall
 ||              || |||||| || ||||+ ntfs!NtfsCopyReadA
 ||              || |||||| || |||| + ntoskrnl!CcCopyReadEx
 ||              || |||||| || |||| |+ ntoskrnl!CcMapAndCopyFromCache
 ||              || |||||| || |||| | + ntoskrnl!CcGetVirtualAddress
 ||              || |||||| || |||| | |+ ntoskrnl!CcUnmapVacbArray
 ||              || |||||| || |||| | ||+ ntoskrnl!CcUnmapVacb
 ||              || |||||| || |||| | |||+ ntoskrnl!MmUnmapViewInSystemCache
 ||              || |||||| || |||| | ||| + ntoskrnl!ExAcquireSpinLockExclusive
 ||              || |||||| || |||| | ||| |+ ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
 
 
Also from this same perfview the following looks to be from the checkpoint process waiting on the same lock
 
 Name
 ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!ExAcquireSpinLockExclusive
 + ntoskrnl!MiAcquireProperVm
  + ntoskrnl!MiTrimSharedPageFromViews
   + ntoskrnl!MiTrimSection
    + ntoskrnl!MmTrimSection
     + ntoskrnl!CcCoherencyFlushAndPurgeCache
      + ntfs!NtfsFlushUserStream
       + ntfs!NtfsPerformOptimisticFlush
       |+ ntfs!NtfsCommonFlushBuffers
       | + ntfs!NtfsCommonFlushBuffersCallout
       |  + ntoskrnl!KeExpandKernelStackAndCalloutInternal
       |   + ntfs!NtfsCommonFlushBuffersOnNewStack
       |    + ntfs!NtfsFsdFlushBuffers
       |     + fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted
       |      + fltmgr!FltpDispatch
       |       + ntoskrnl!IopSynchronousServiceTail
       |        + ntoskrnl!NtFlushBuffersFileEx
       |         + ntoskrnl!NtFlushBuffersFile
       |          + ntdll!NtFlushBuffersFile
       |          |+ kernelbase!FlushFileBuffers
       |          | + msvcr120!_commit
       |          |  + postgres!FileClose
       |          |   + postgres!mdtruncate
       |          |    + postgres!??ReleaseBuffer
       |          |     + postgres!CreateCheckPoint
       |          |      + postgres!CheckpointerMain
       |          |       + postgres!AuxiliaryProcessMain
       |          |        + postgres!MaxLivePostmasterChildren
       |          |         + postgres!main
       |          |          + postgres!_onexit
       |          |           + kernel32!BaseThreadInitThunk
       |          |            + ntdll!RtlUserThreadStart


In order to get by we increased the shared_buffers from 500MB to 50GB on this server (and 10GB on another server) but in my opinion this is just masking the issue.  Was wondering if anyone in the community has seen contention with this lock before or has other any insights as to why we would suddenly run into this issue?


Ben Snaidero
Geotab
Senior Database Specialist
Direct+1 (289) 230-7749
Toll-free+1 (877) 436-8221
Visitwww.geotab.com
Twitter|Facebook|YouTube|LinkedIn

Join us at Connect 2020

San Diego

January 13 - 16, 2020

Register Now!

pgsql-performance by date:

Previous
From: Guillaume Cottenceau
Date:
Subject: Re: much slower query in production
Next
From: Michael Lewis
Date:
Subject: Re: Many DataFileRead - IO waits