Home > mailing lists

Re: Dynamic gathering the values for seq_page_cost/xxx_cost - Mailing list pgsql-hackers

From	Andy Fan
Subject	Re: Dynamic gathering the values for seq_page_cost/xxx_cost
Date	September 21, 2020 03:41:09
Msg-id	CAKU4AWpSvKu+Oeo8rpDfCSPCzgSK2GB8MVm5cGwkHQ37eO8-Og@mail.gmail.com Whole thread Raw
In response to	Re: Dynamic gathering the values for seq_page_cost/xxx_cost (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: Dynamic gathering the values for seq_page_cost/xxx_cost Re: Dynamic gathering the values for seq_page_cost/xxx_cost
List	pgsql-hackers

Tree view

Hi Tomas:

Thanks for checking.

On Fri, Sep 18, 2020 at 9:50 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

>I recently tried something in this direction and the result looks
>promising based on my limited test.
>
>Since the unit of a xxx_cost is "seq_page_cost", then how to detect
>seq_page_cost is important. In the cost model, the IO cost of a seqscan is
>rel->pages * seq_page_cost, it doesn't consider any cache (file system
>cache or
>shared buffer cache). However, it assumes the OS will prefetch the IO. So
>to
>detect the seq_page_cost, I enabled the prefetch but avoided the file system
>cache. I tested this with 1). drop the cache on the file system. 2). Open
>the test
>file without O_DIRECT so that the prefetch can work.
>
>To detect the random page read, I read it with pread with a random offset.
>Since the random offsets may be the same as each other during the test,
>so even dropping the file system cache at the beginning doesn't work. so
>I open it with the O_DIRECT option.
>
>I also measure the cost of reading a page from a file system cache, during
>my test, it is about 10% of a seq scan read.
>
>After I get the basic numbers about the hardware capability, I let the user
>provide a cache hit ratio (This is a place where we can further improve if
>this
>is a right direction).
>
>Here is the test result on my hardware.
>
>fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat =
>73.987732us
>
>cache hit ratio: 1.000000 random_page_cost 1.000000
>cache hit ratio: 0.900000 random_page_cost 5.073692
>cache hit ratio: 0.500000 random_page_cost 7.957589
>cache hit ratio: 0.100000 random_page_cost 8.551591
>cache hit ratio: 0.000000 random_page_cost 8.633049
>
>
>Then I tested the suggested value with the 10GB TPCH
>workload. I compared the plans with 2 different settings random_page_cost =
>1). 4 is the default value) 2). 8.6 the cache hint ratio = 0 one. Then 11
>out of the 22
>queries generated a different plan. At last I drop the cache (including
>both
>file system cache and shared_buffer) before run each query and run the 11
>queries
>under the 2 different settings. The execution time is below.
>
>
>| | random_page_cost=4 | random_page_cost=8.6 |
>|-----+--------------------+----------------------|
>| Q1 | 1425.964 | 1121.928 |
>| Q2 | 2553.072 | 2567.450 |
>| Q5 | 4397.514 | 1475.343 |
>| Q6 | 12576.985 | 4622.503 |
>| Q7 | 3459.777 | 2987.241 |
>| Q8 | 8360.995 | 8415.311 |
>| Q9 | 4661.842 | 2930.370 |
>| Q11 | 4885.289 | 2348.541 |
>| Q13 | 2610.937 | 1497.776 |
>| Q20 | 13218.122 | 10985.738 |
>| Q21 | 264.639 | 262.350 |
>
>
>The attached main.c is the program I used to detect the
>random_page_cost. result.tar.gz is the test result, you can run a git log
>first
>to see the difference on plan or execution stat.
>
>Any feedback is welcome. Thanks!
>

That seems pretty neat. What kind of hardware have you done these tests
on?

The following is my hardware info.

I have 12 SSD behind the MR9271-8i RAID Controller which has a 1GB buffer. [1]

root# lshw -short -C disk
H/W path Device Class Description
==============================================================
/0/100/2/0/2.0.0 /dev/sda disk 2398GB MR9271-8i
/0/100/2/0/2.1.0 /dev/sdb disk 5597GB MR9271-8i <-- my data location

/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL

Adapter #0

Memory Size : 1024MB
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3
..
Current Cache Policy: WriteBack, ReadAheadNone, Direct, Write Cache OK if Bad
BBU
...
Device Present
================
Virtual Drives : 2
Degraded : 0
Offline : 0
Physical Devices : 14
Disks : 12
Critical Disks : 0
Failed Disks : 0

root# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | egrep 'Media Type|Raw Size'
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device

CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 32 processors.
Memory: 251 GB
Linux: 3.10.0-327
fs: ext4. mount options: defaults,noatime,nodiratime,nodelalloc,barrier=0

Physical machine.

It's probably worth testing on various other storage systems to see
how that applies to those.

Yes, I can test more on new hardware once I get it. Now it is still in progress.

However I can only get a physical machine with SSD or Virtual machine with

SSD, other types are hard for me right now.

Have you tried existing I/O testing tools, e.g. fio? If your idea is to
propose some built-in tool (similar to pg_test_fsync) then we probably
should not rely on external tools, but I wonder if we're getting the
same numbers.

Thanks for this hint, I found more interesting stuff during the comparison.

I define the FIO jobs as below.

random_page_cost.job:
[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio

[random_page_cost]
direct=1
readwrite=randread

Even it is direct IO, the device cache still plays an important
part. The device cache is filled in preparing the test data file stage.
I invalidate the device cache by writing a new dummy file. At last the avg
latency time is 148 us.

seq.job

[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio

[seq_page_cost]
buffered=1
readwrite=read

For seq read, We need buffered IO for perfetch, however, we need to bypass the file
system cache and device cache. fio have no control of such caches, so I did:

1). Run fio to generate the test file.
2). Invalidate device cache first with dd if=/dev/zero of=a_dummy_file bs=1048576 count=1024
3). drop the file system cache.
4). Run the fio again.

The final avg latency is ~12 us.

This is 1.5 ~ 2 X difference with my previous result. (seq_read_lat = 8.570290us, random_page_lat =
73.987732us)

Here are some changes for my detection program.

| | seq_read_lat (us) | random_read_lat (us) |
| FIO | 12 | 148 |
| Previous main.c | 8.5 | 74 |
| invalidate_device_cache before each testing | 9 | 150 |
| prepare the test data file with O_DIRECT option | 15 | 150 |

In invalidate_device_cache, I just create another 1GB data file and read
it. (see invalidate_device_cache function) this is similar as the previous fio setup.

prepare test data file with O_DIRECT option means in the past, I prepare the test
file with buffer IO. and before testing, I do invalidate device cache, file
system cache. but the buffered prepared file still get better performance, I
have no idea of it. Since I don't want any cache. I use O_DIRECT
option at last. The seq_read_lat changed from 9us to 15us.

I still can't find out the 25% difference with the FIO result. (12 us vs 9 us).

At last, the random_page_cost happens to not change very much.

/u/y/g/fdirect> sudo ./main
fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat = 148.650589us

cache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 6.401019
cache hit ratio: 0.500000 random_page_cost 7.663772
cache hit ratio: 0.100000 random_page_cost 7.841498
cache hit ratio: 0.000000 random_page_cost 7.864383

This result looks much different from "we should use 1.1 ~ 1.5 for SSD".

The attached is the modified detection program.

[1] https://www.cdw.com/product/lsi-megaraid-sas-9271-8i-storage-controller-raid-sas-pcie-3.0-x8/4576538#PO

Best Regards

Andy Fan

Attachment

main.c

pgsql-hackers by date:

From: Amit Kapila
Date: 21 September 2020, 03:31:48
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."

From: "Wang, Shenhao"
Date: 21 September 2020, 03:43:41
Subject: make MaxBackends available in _PG_init

Re: Dynamic gathering the values for seq_page_cost/xxx_cost - Mailing list pgsql-hackers

Attachment

Previous

Next