Re: Disk wait problem... may not be hardware... - Mailing list pgsql-general

From pf@pfortin.com
Subject Re: Disk wait problem... may not be hardware...
Date
Msg-id 20231026220325.2204a3f6.pfortin@pfortin.com
Whole thread Raw
In response to Re: Disk wait problem...  (pf@pfortin.com)
Responses Re: Disk wait problem... may not be hardware...
Re: Disk wait problem... may not be hardware...
List pgsql-general
On Thu, 26 Oct 2023 15:50:16 -0400 pf@pfortin.com wrote:

>Hi Peter,
>
>All of the following is based on using SQL_workbench/J (WB)  (Mageia Linux
>has not provided a viable pgAdmin4 package); WB is setup to autoload the
>table row count and 10 rows. I'm sticking to one set of files where they
>are all roughly 33.6M rows.
>
>I've been doing a lot of digging and found more interesting info...
>
>At first, I thought the disk was in disk wait the entire time of the
>delay; this _appeared_ to be the case since the system journal logged
>everything about each query, with every query looking the same; no errors.
>
>I was going to provide some logs; but suffice it to say that the logs,
>combined with running iostat every second has been enlightening...
>
>Every time I switch from one table to another, sometimes the results
>display in:
>* less than a half-second
>* one or more seconds
>* up to 5+ minutes.
>So far, none has failed to eventually return; just l..o..n..g delays...
>
>This morning, I've been using iostat:
>$ while true; do echo -n "$(date +%H:%M:%S)  "; iostat /dev/nvme1n1 |
>grep nvme1n1; sleep 1; done
>
>When the response is fast; I see iostat report ~225000 bytes read. On the
>other hand when the query takes a long time to respond, I see the read
>bytes spread over time...  a little every second...  So while htop
>appears to report a steady D (disk wait) status, I now suspect the short
>transfers are fast and most of the time, htop is going to catch the D
>state.
>
>It definitely appears to be a problem with the 4TB NVMe SSD drive which
>utilities say it still has 95% life left[1].  The drive is reporting:
>Error Information Log Entries:      1,177
>which is growing, so I'll be digging into that.  It also reports:
>
>Error Information (NVMe Log 0x01, 16 of 63 entries)
>Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
>  0       1177     0  0xc002  0x4004      -            0     1     -  Invalid Field in Command
>  1       1176     0  0xb001  0x4004      -            0     1     -  Invalid Field in Command
>  2       1175     0  0x801a  0x4004      -            0     1     -  Invalid Field in Command
>...
>This list is not growing, so I'll look into it later than sooner.
>
>A few moments ago, I decided to switch to the backup which resides on a
>17TB mechanical platter drive which is running great.
>
>[1] SSD drives wear out with usage; think of it as a vehicle tire which
>slowly wears out.
>
>Also found that there is a firmware update available from:
>https://www.seagate.com/support/downloads/
>
>No idea if this will solve the problem; but...
>
>Thanks for your suggestions.
>
>Thought I saw there may be instrumentation within PG which I can access;
>suggestions welcome if so...
>
>Cheers,
>Pierre
>
>
>On Thu, 26 Oct 2023 11:50:00 +0200 Peter J. Holzer wrote:
>
>>On 2023-10-23 16:31:30 -0700, Adrian Klaver wrote:  
>>> On 10/23/23 14:55, pf@pfortin.com wrote:    
>>> > Next, I made a query list for 30 tables, turned on timing and pasted the
>>> > queries.  Pasting results inline:
>>> > ncsbe=# select count(*) from ncvhis_2012_12_06;
>>> > select count(*) from ncvhis_2016_12_03;
>>> > select count(*) from ncvhis_2020_08_01;
>>> > select count(*) from ncvhis_2020_10_31;    
>>[...]  
>>> > Time: 72404.786 ms (01:12.405)
>>> > Time: 301.246 ms
>>> > Time: 409.974 ms
>>> > Time: 363.836 ms
>>> > Time: 351.167 ms
>>> > Time: 348.378 ms
>>> > Time: 348.712 ms
>>> > Time: 351.901 ms
>>> > Time: 487.837 ms
>>> > Time: 40987.826 ms (00:40.988)
>>> > Time: 76964.281 ms (01:16.964)
>>> > Time: 483.329 ms
>>> > Time: 18919.267 ms (00:18.919)
>>> > Time: 50896.978 ms (00:50.897)
>>> > Time: 25784.000 ms (00:25.784)
>>> > Time: 45407.599 ms (00:45.408)
>>> > Time: 75171.609 ms (01:15.172)
>>> > Time: 88871.004 ms (01:28.871)
>>> > Time: 128565.127 ms (02:08.565)
>>> > Time: 210248.222 ms (03:30.248)
>>> > Time: 246609.561 ms (04:06.610)   4 minutes!  ;p
>>> > Time: 166406.283 ms (02:46.406)
>>> > Time: 291483.538 ms (04:51.484)  nearly 5 minutes ;p ;p
>>> > Time: 2269.961 ms (00:02.270)
>>> > Time: 413.219 ms
>>> > Time: 433.817 ms
>>> > Time: 395.844 ms
>>> > Time: 7251.637 ms (00:07.252)
>>> > Time: 384.328 ms
>>> > Time: 384.887 ms
>>> > 
>>> > I don't see a pattern in the above; matches the randomness I saw using
>>> > WB...    
>>
>>It would be interesting to see whether there's a difference in plan
>>between the slow and the fast queries.
>>
>>If your experiment above is repeatable, you can just prefix every
>>query with 
>>    explain (analyze, buffers)
>>
>>
>>explain (analyze, buffers) select count(*) from ncvhis_2016_12_03;
>>explain (analyze, buffers) select count(*) from ncvhis_2020_08_01;
>>explain (analyze, buffers) select count(*) from ncvhis_2020_10_31;
>>...
>>
>>If it is hard to reproduce (happens only sometimes), the best way to
>>catch is probably to enable auto_explain
>><https://www.postgresql.org/docs/current/auto-explain.html>, possibly
>>with the auto_explain.log_analyze option and check the logs after it
>>happened. (auto_explain is generally a good idea, IMHO).
>>
>>  
>>> > Before the above finished, I issued this command on another konsole...
>>> > 
>>> > $ while true; do ls -l > /tmp/ll; date; done    
>>
>>This is unlikely to generate noticeable disk waits. The current
>>directory will be in the cache after the first ls and the writes happen
>>asynchroneously.
>>
>>        hp

Now, I'm not sure what's wrong...  I switched to the backup DB and still
seeing disk wait issues...  

Production DB is on PCIe 4TB NVMe SSD Samsung Firecuda 530
(No difference after updating firmware)

Backup DB is on SATA 17TB Seagate ST18000NM000J-2T

I can't think of a common hardware bus/other that would only affect
PostgreSQL disk accesses.

Are there any extra PG low level logs that can be turned on?  I've only
found logging in the journal; that shows nothing unusual, except when
examining the timestamps which show varying query times -- that's the
only clue to slow queries other than the disk wait (D) status and small
transfers seen via iostat... 

Pierre



pgsql-general by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Question regarding the new SQL standard
Next
From: Alexander Rumyantsev
Date:
Subject: Correct way of using complex expressions as partitioning key