Re: Dynamic gathering the values for seq_page_cost/xxx_cost - Mailing list pgsql-hackers

From Andy Fan
Subject Re: Dynamic gathering the values for seq_page_cost/xxx_cost
Date
Msg-id CAKU4AWpSvKu+Oeo8rpDfCSPCzgSK2GB8MVm5cGwkHQ37eO8-Og@mail.gmail.com
Whole thread Raw
In response to Re: Dynamic gathering the values for seq_page_cost/xxx_cost  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Dynamic gathering the values for seq_page_cost/xxx_cost
Re: Dynamic gathering the values for seq_page_cost/xxx_cost
List pgsql-hackers
Hi Tomas:
   Thanks for checking.  

On Fri, Sep 18, 2020 at 9:50 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>I recently tried something in this direction and the result looks
>promising based on my limited test.
>
>Since the unit of a xxx_cost is "seq_page_cost", then how to detect
>seq_page_cost is important. In the cost model, the IO cost of a seqscan is
>rel->pages * seq_page_cost, it doesn't consider any cache (file system
>cache or
>shared buffer cache).  However, it assumes the OS will prefetch the IO. So
>to
>detect the seq_page_cost, I enabled the prefetch but avoided the file system
>cache. I tested this with 1). drop the cache on the file system. 2). Open
>the test
>file without O_DIRECT so that the prefetch can work.
>
>To detect the random page read, I read it with pread with a random offset.
>Since the random offsets may be the same as each other during the test,
>so even dropping the file system cache at the beginning doesn't work. so
>I open it with the O_DIRECT option.
>
>I also measure the cost of reading a page from a file system cache, during
>my test, it is about 10% of a seq scan read.
>
>After I get the basic numbers about the hardware capability, I let the user
>provide a cache hit ratio (This is a place where we can further improve if
>this
>is a right direction).
>
>Here is the test result on my hardware.
>
>fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat =
>73.987732us
>
>cache hit ratio: 1.000000 random_page_cost 1.000000
>cache hit ratio: 0.900000 random_page_cost 5.073692
>cache hit ratio: 0.500000 random_page_cost 7.957589
>cache hit ratio: 0.100000 random_page_cost 8.551591
>cache hit ratio: 0.000000 random_page_cost 8.633049
>
>
>Then I tested the suggested value with the 10GB TPCH
>workload. I compared the plans with 2 different settings random_page_cost =
>1). 4 is the default value) 2). 8.6  the cache hint ratio = 0 one.  Then 11
>out of the 22
>queries generated a different plan.  At last I drop the cache (including
>both
>file system cache and shared_buffer) before run each query and run the 11
>queries
>under the 2 different settings. The execution time is below.
>
>
>|     | random_page_cost=4 | random_page_cost=8.6 |
>|-----+--------------------+----------------------|
>| Q1  |           1425.964 |             1121.928 |
>| Q2  |           2553.072 |             2567.450 |
>| Q5  |           4397.514 |             1475.343 |
>| Q6  |          12576.985 |             4622.503 |
>| Q7  |           3459.777 |             2987.241 |
>| Q8  |           8360.995 |             8415.311 |
>| Q9  |           4661.842 |             2930.370 |
>| Q11 |           4885.289 |             2348.541 |
>| Q13 |           2610.937 |             1497.776 |
>| Q20 |          13218.122 |            10985.738 |
>| Q21 |            264.639 |              262.350 |
>
>
>The attached main.c is the program I used to detect the
>random_page_cost. result.tar.gz is the test result, you can run a git log
>first
>to see the difference on plan or execution stat.
>
>Any feedback is welcome. Thanks!
>

That seems pretty neat. What kind of hardware have you done these tests
on?

The following is my hardware info.

I have 12 SSD behind the MR9271-8i RAID Controller which has a 1GB buffer. [1]

root#  lshw -short -C disk
H/W path            Device          Class          Description
==============================================================
/0/100/2/0/2.0.0    /dev/sda        disk           2398GB MR9271-8i
/0/100/2/0/2.1.0    /dev/sdb        disk           5597GB MR9271-8i <-- my data location


/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL

Adapter #0

Memory Size      : 1024MB
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
..
Current Cache Policy: WriteBack, ReadAheadNone, Direct, Write Cache OK if Bad
BBU
...
                Device Present
                ================
Virtual Drives    : 2
  Degraded        : 0
  Offline         : 0
Physical Devices  : 14
  Disks           : 12
  Critical Disks  : 0
  Failed Disks    : 0


root# /opt/MegaRAID/MegaCli/MegaCli64  -LdPdInfo -a0 | egrep 'Media Type|Raw Size'
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device

CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 32 processors.
Memory: 251 GB
Linux: 3.10.0-327
fs: ext4. mount options: defaults,noatime,nodiratime,nodelalloc,barrier=0
Physical machine.

It's probably worth testing on various other storage systems to see
how that applies to those.

Yes, I can test more on new hardware once I get it. Now it is still in progress. 
However I can only get a physical machine with SSD or  Virtual machine with
SSD, other types are hard for me right now. 


Have you tried existing I/O testing tools, e.g. fio? If your idea is to
propose some built-in tool (similar to pg_test_fsync) then we probably
should not rely on external tools, but I wonder if we're getting the
same numbers.

Thanks for this hint, I found more interesting stuff during the comparison. 

I define the FIO jobs as below.

random_page_cost.job:
[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio

[random_page_cost]
direct=1
readwrite=randread


Even it is direct IO, the device cache still plays an important
part. The device cache is filled in preparing the test data file stage.
I invalidate the device cache by writing a new dummy file. At last the avg
latency time is 148 us.


seq.job

[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio

[seq_page_cost]
buffered=1
readwrite=read

For seq read, We need buffered IO for perfetch, however, we need to bypass the file
system cache and device cache. fio have no control of such caches, so I did:

1). Run fio to generate the test file.
2). Invalidate device cache first with dd if=/dev/zero of=a_dummy_file bs=1048576 count=1024
3). drop the file system cache.
4). Run the fio again.

The final avg latency is ~12 us.

This is 1.5 ~ 2 X difference with my previous result. (seq_read_lat = 8.570290us, random_page_lat =
73.987732us)

Here are some changes for my detection program.

|                                                 | seq_read_lat (us) | random_read_lat (us) |
| FIO                                             |                12 |                  148 |
| Previous main.c                                 |               8.5 |                   74 |
| invalidate_device_cache before each testing     |                 9 |                  150 |
| prepare the test data file with O_DIRECT option |                15 |                  150 |


In invalidate_device_cache, I just create another 1GB data file and read
it. (see invalidate_device_cache function) this is similar as the previous fio setup.

prepare test data file with O_DIRECT option means in the past, I prepare the test
file with buffer IO. and before testing, I do invalidate device cache, file
system cache. but the buffered prepared file still get better performance, I
have no idea of it. Since I don't want any cache.  I use O_DIRECT
option at last. The seq_read_lat changed from 9us to 15us. 
I still can't find out the 25% difference with the FIO result. (12 us vs 9 us).

At last, the random_page_cost happens to not change very much.

/u/y/g/fdirect> sudo ./main
fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat = 148.650589us

cache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 6.401019
cache hit ratio: 0.500000 random_page_cost 7.663772
cache hit ratio: 0.100000 random_page_cost 7.841498
cache hit ratio: 0.000000 random_page_cost 7.864383

This result looks much different from "we should use 1.1 ~ 1.5 for SSD".

The attached is the modified detection program.
 

--
Best Regards
Andy Fan
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."
Next
From: "Wang, Shenhao"
Date:
Subject: make MaxBackends available in _PG_init