[SOLVED?] Re: Disk wait problem... not hardware... - Mailing list pgsql-general
From | pf@pfortin.com |
---|---|
Subject | [SOLVED?] Re: Disk wait problem... not hardware... |
Date | |
Msg-id | 20231029092146.53351f7e@pfortin.com Whole thread Raw |
In response to | Re: Disk wait problem... may not be hardware... (Jim Mlodgenski <jimmy76@gmail.com>) |
Responses |
Re: [SOLVED?] Re: Disk wait problem... not hardware...
|
List | pgsql-general |
On Sat, 28 Oct 2023 18:34:50 -0400 Jim Mlodgenski wrote: Looking like a GOLD star for Jim... >On Fri, Oct 27, 2023 at 7:46 PM <pf@pfortin.com> wrote: > >> Memory: 125.5 GiB of RAM >> >It looks like you have a large amount of memory allocated to the server > >But your plans are doing reads instead of pulling things from shared >buffers Assuming you are referring to "Buffers: shared hit=248 read=25022" ? >>explain (analyze, buffers) select count(*) from ncvhis_2016_12_03; >> Finalize Aggregate (cost=404669.65..404669.66 rows=1 width=8) (actual >> time=844.158..847.309 rows=1 loops=1) Buffers: shared hit=248 read=25022 >> -> Gather (cost=404669.43..404669.65 rows=2 width=8) (actual >> time=844.133..847.301 rows=3 loops=1) Workers Planned: 2 >> Workers Launched: 2 >> Buffers: shared hit=248 read=25022 >> -> Partial Aggregate (cost=403669.43..403669.45 rows=1 >> width=8) (actual time=838.772..838.772 rows=1 loops=3) Buffers: shared >> hit=248 read=25022 -> Parallel Index Only Scan using >> ncvhis_2016_12_03_voted_party_cd_idx on ncvhis_2016_12_03 >> (cost=0.44..372735.05 rows=12373755 width=0) (actual >> time=18.277..592.473 rows=9900389 loops=3) Heap Fetches: 0 Buffers: >> shared hit=248 read=25022 Planning Time: 0.069 ms JIT: >> Functions: 8 >> Options: Inlining false, Optimization false, Expressions true, >> Deforming true Timing: Generation 0.284 ms, Inlining 0.000 ms, >> Optimization 0.268 ms, Emission 3.590 ms, Total 4.143 ms Execution >> Time: 847.498 ms >> >> > > >data/postgresql.conf: >> max_connections = 100 >> shared_buffers = 128MB I semi-misquoted; it was running on 128MB when first reported; then 512MB when I sent this message. >It looks like you are running with the stock config for shared_buffers. >With only 128MB dedicated for shared_buffers and such a big database, >you'll be thrashing the cache pretty hard. With 125GB on the server, try >upping shared_buffers to something more like 32GB. WOW! Some tables now have a barely noticeable delay; on the whole, looking MUCH better. Testing with a just created copy[1] on the NVMe, it's better; but it's slowly going back to longer delays. Looks like count(*) may have easily-missed side-effects (more below). My team members were suggesting a possible motherboard issue since the problem happened on both the NVMe M.2 SSD, and on a backup SATA-3 platter drive. Your suggestion arrived just in time as I was about to wipe out the SSD and re-import everything fresh. Another thing that bothers me is that when I copied the DB, the: cp -a ./var ./var2 # same drive, same partition replicating the DB onto free space on the same SSD, it was experiencing slow reads 18-20KB/sec; the writes were quick, large blocks; so I was suspecting something with the SSD... System buffers..? How autoload row count (or count(*)) can hurt: The queries which triggered this issue are all the same (WB built-in): SELECT count(*) FROM table; # WB = SQL-workbench/J These are all static tables. Does PG maintain a table row count so as to avoid having to count each time? WB is setup to: * autoload table row count * autoload table data (restricted with LIMIT) Walking through these 8M & 33M row tables (3.8GB & 5.2GB respectively) must be filling the shared_buffers and thrashing even 32GB... Ah-ha! Turning off "autoload table row count" has "resolved" this issue. I can now scroll through all the tables at high speed (hold down the arrow key to walk through all the tables. Not very useful; but walking through the tables with only 128MB shared-buffers would have been the ultimate buffer thrashing... ;p Even walking through the tables to collect their row counts (whether manually or programmatically. ). That said, the 32GB shared-buffers may now be masking NVMe drive issues... Thanks!!!, Pierre Time and complexity to find the cause of an issue is inversely proportional to the simplicity of the issue...
pgsql-general by date: