Thread: Dynamic gathering the values for seq_page_cost/xxx_cost

Dynamic gathering the values for seq_page_cost/xxx_cost

From
Andy Fan
Date:
The optimizer cost model usually needs 2 inputs,  one is used to represent data distribution and the other one is used to represent the capacity of the hardware, like cpu/io let's call this one as system stats.

In Oracle database, the system stats can be gathered with dbms_stats.gather_system_stats [1] on the running hardware,  In postgresql,  the value is set on based on experience (user can change the value as well, but is should be hard to decide which values they should use).  The pg way is not perfect in theory(In practice, it may be good enough or not).  for example,  HDD & SSD have different capacity regards to seq_scan_cost/random_page_cost,   cpu cost may also different on different hardware as well. 

I run into a paper [2] which did some research on dynamic gathering the values for xxx_cost, looks it is interesting.  However it doesn't provide the code for others to do more research.  before I dive into this,  It would be great to hear some suggestion from experts.  

so what do you think about this method and have we have some discussion about this before and the result? 


Thanks

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From
Tomas Vondra
Date:
On Tue, Nov 26, 2019 at 08:59:22AM +0800, Andy Fan wrote:
>The optimizer cost model usually needs 2 inputs,  one is used to represent
>data distribution and the other one is used to represent the capacity of
>the hardware, like cpu/io let's call this one as system stats.
>
>In Oracle database, the system stats can be gathered with
>dbms_stats.gather_system_stats [1] on the running hardware,  In
>postgresql,  the value is set on based on experience (user can change the
>value as well, but is should be hard to decide which values they should
>use).  The pg way is not perfect in theory(In practice, it may be good
>enough or not).  for example,  HDD & SSD have different capacity regards to
>seq_scan_cost/random_page_cost,   cpu cost may also different on different
>hardware as well.
>
>I run into a paper [2] which did some research on dynamic gathering the
>values for xxx_cost, looks it is interesting.  However it doesn't provide
>the code for others to do more research.  before I dive into this,  It
>would be great to hear some suggestion from experts.
>
>so what do you think about this method and have we have some discussion
>about this before and the result?
>

IMHO it would be great to have a tool that helps with tuning those
parameters, particularly random_page_cost. I'm not sure how feasible it
is, though, but if you're willing to do some initial experiments and
research, I think it's worth looking into.

It's going to be challenging, though, because even random_page_cost=4
mismatches the "raw" characteristics on any existing hardware. On old
drives the sequential/random difference is way worse, on SSDs it's about
right. But then again, we know random_page_cost=1.5 or so works mostly
fine on SSDs, and that's much lower than just raw numbers.

So it's clearly one thing to measure HW capabilities, and it's another
thing to conclude what the parameters should be ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From
Andy Fan
Date:


On Thu, Nov 28, 2019 at 12:48 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Tue, Nov 26, 2019 at 08:59:22AM +0800, Andy Fan wrote:
>The optimizer cost model usually needs 2 inputs,  one is used to represent
>data distribution and the other one is used to represent the capacity of
>the hardware, like cpu/io let's call this one as system stats.
>
>In Oracle database, the system stats can be gathered with
>dbms_stats.gather_system_stats [1] on the running hardware,  In
>postgresql,  the value is set on based on experience (user can change the
>value as well, but is should be hard to decide which values they should
>use).  The pg way is not perfect in theory(In practice, it may be good
>enough or not).  for example,  HDD & SSD have different capacity regards to
>seq_scan_cost/random_page_cost,   cpu cost may also different on different
>hardware as well.
>
>I run into a paper [2] which did some research on dynamic gathering the
>values for xxx_cost, looks it is interesting.  However it doesn't provide
>the code for others to do more research.  before I dive into this,  It
>would be great to hear some suggestion from experts.
>
>so what do you think about this method and have we have some discussion
>about this before and the result?
>

IMHO it would be great to have a tool that helps with tuning those
parameters, particularly random_page_cost. I'm not sure how feasible it
is, though, but if you're willing to do some initial experiments and
research, I think it's worth looking into.

It's going to be challenging, though, because even random_page_cost=4
mismatches the "raw" characteristics on any existing hardware. On old
drives the sequential/random difference is way worse, on SSDs it's about
right. But then again, we know random_page_cost=1.5 or so works mostly
fine on SSDs, and that's much lower than just raw numbers.

So it's clearly one thing to measure HW capabilities, and it's another
thing to conclude what the parameters should be ...


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

I recently tried something in this direction and the result looks
promising based on my limited test.

Since the unit of a xxx_cost is "seq_page_cost", then how to detect
seq_page_cost is important. In the cost model, the IO cost of a seqscan is
rel->pages * seq_page_cost, it doesn't consider any cache (file system cache or
shared buffer cache).  However, it assumes the OS will prefetch the IO. So to
detect the seq_page_cost, I enabled the prefetch but avoided the file system
cache. I tested this with 1). drop the cache on the file system. 2). Open the test 
file without O_DIRECT so that the prefetch can work.

To detect the random page read, I read it with pread with a random offset.
Since the random offsets may be the same as each other during the test, 
so even dropping the file system cache at the beginning doesn't work. so
I open it with the O_DIRECT option. 

I also measure the cost of reading a page from a file system cache, during
my test, it is about 10% of a seq scan read.

After I get the basic numbers about the hardware capability, I let the user
provide a cache hit ratio (This is a place where we can further improve if this
is a right direction).

Here is the test result on my hardware.

fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat = 73.987732us

cache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 5.073692
cache hit ratio: 0.500000 random_page_cost 7.957589
cache hit ratio: 0.100000 random_page_cost 8.551591
cache hit ratio: 0.000000 random_page_cost 8.633049


Then I tested the suggested value with the 10GB TPCH
workload. I compared the plans with 2 different settings random_page_cost =
1). 4 is the default value) 2). 8.6  the cache hint ratio = 0 one.  Then 11 out of the 22
queries generated a different plan.  At last I drop the cache (including both
file system cache and shared_buffer) before run each query and run the 11 queries
under the 2 different settings. The execution time is below.


|     | random_page_cost=4 | random_page_cost=8.6 |
|-----+--------------------+----------------------|
| Q1  |           1425.964 |             1121.928 |
| Q2  |           2553.072 |             2567.450 |
| Q5  |           4397.514 |             1475.343 |
| Q6  |          12576.985 |             4622.503 |
| Q7  |           3459.777 |             2987.241 |
| Q8  |           8360.995 |             8415.311 |
| Q9  |           4661.842 |             2930.370 |
| Q11 |           4885.289 |             2348.541 |
| Q13 |           2610.937 |             1497.776 |
| Q20 |          13218.122 |            10985.738 |
| Q21 |            264.639 |              262.350 |


The attached main.c is the program I used to detect the
random_page_cost. result.tar.gz is the test result, you can run a git log first
to see the difference on plan or execution stat.

Any feedback is welcome. Thanks!


--
Best Regards
Andy Fan
Attachment

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From
Tomas Vondra
Date:
On Fri, Sep 18, 2020 at 09:28:10PM +0800, Andy Fan wrote:
>On Thu, Nov 28, 2019 at 12:48 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>
>> On Tue, Nov 26, 2019 at 08:59:22AM +0800, Andy Fan wrote:
>> >The optimizer cost model usually needs 2 inputs,  one is used to represent
>> >data distribution and the other one is used to represent the capacity of
>> >the hardware, like cpu/io let's call this one as system stats.
>> >
>> >In Oracle database, the system stats can be gathered with
>> >dbms_stats.gather_system_stats [1] on the running hardware,  In
>> >postgresql,  the value is set on based on experience (user can change the
>> >value as well, but is should be hard to decide which values they should
>> >use).  The pg way is not perfect in theory(In practice, it may be good
>> >enough or not).  for example,  HDD & SSD have different capacity regards
>> to
>> >seq_scan_cost/random_page_cost,   cpu cost may also different on different
>> >hardware as well.
>> >
>> >I run into a paper [2] which did some research on dynamic gathering the
>> >values for xxx_cost, looks it is interesting.  However it doesn't provide
>> >the code for others to do more research.  before I dive into this,  It
>> >would be great to hear some suggestion from experts.
>> >
>> >so what do you think about this method and have we have some discussion
>> >about this before and the result?
>> >
>>
>> IMHO it would be great to have a tool that helps with tuning those
>> parameters, particularly random_page_cost. I'm not sure how feasible it
>> is, though, but if you're willing to do some initial experiments and
>> research, I think it's worth looking into.
>>
>> It's going to be challenging, though, because even random_page_cost=4
>> mismatches the "raw" characteristics on any existing hardware. On old
>> drives the sequential/random difference is way worse, on SSDs it's about
>> right. But then again, we know random_page_cost=1.5 or so works mostly
>> fine on SSDs, and that's much lower than just raw numbers.
>>
>> So it's clearly one thing to measure HW capabilities, and it's another
>> thing to conclude what the parameters should be ...
>>
>>
>> regards
>>
>> --
>> Tomas Vondra                  http://www.2ndQuadrant.com
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>
>I recently tried something in this direction and the result looks
>promising based on my limited test.
>
>Since the unit of a xxx_cost is "seq_page_cost", then how to detect
>seq_page_cost is important. In the cost model, the IO cost of a seqscan is
>rel->pages * seq_page_cost, it doesn't consider any cache (file system
>cache or
>shared buffer cache).  However, it assumes the OS will prefetch the IO. So
>to
>detect the seq_page_cost, I enabled the prefetch but avoided the file system
>cache. I tested this with 1). drop the cache on the file system. 2). Open
>the test
>file without O_DIRECT so that the prefetch can work.
>
>To detect the random page read, I read it with pread with a random offset.
>Since the random offsets may be the same as each other during the test,
>so even dropping the file system cache at the beginning doesn't work. so
>I open it with the O_DIRECT option.
>
>I also measure the cost of reading a page from a file system cache, during
>my test, it is about 10% of a seq scan read.
>
>After I get the basic numbers about the hardware capability, I let the user
>provide a cache hit ratio (This is a place where we can further improve if
>this
>is a right direction).
>
>Here is the test result on my hardware.
>
>fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat =
>73.987732us
>
>cache hit ratio: 1.000000 random_page_cost 1.000000
>cache hit ratio: 0.900000 random_page_cost 5.073692
>cache hit ratio: 0.500000 random_page_cost 7.957589
>cache hit ratio: 0.100000 random_page_cost 8.551591
>cache hit ratio: 0.000000 random_page_cost 8.633049
>
>
>Then I tested the suggested value with the 10GB TPCH
>workload. I compared the plans with 2 different settings random_page_cost =
>1). 4 is the default value) 2). 8.6  the cache hint ratio = 0 one.  Then 11
>out of the 22
>queries generated a different plan.  At last I drop the cache (including
>both
>file system cache and shared_buffer) before run each query and run the 11
>queries
>under the 2 different settings. The execution time is below.
>
>
>|     | random_page_cost=4 | random_page_cost=8.6 |
>|-----+--------------------+----------------------|
>| Q1  |           1425.964 |             1121.928 |
>| Q2  |           2553.072 |             2567.450 |
>| Q5  |           4397.514 |             1475.343 |
>| Q6  |          12576.985 |             4622.503 |
>| Q7  |           3459.777 |             2987.241 |
>| Q8  |           8360.995 |             8415.311 |
>| Q9  |           4661.842 |             2930.370 |
>| Q11 |           4885.289 |             2348.541 |
>| Q13 |           2610.937 |             1497.776 |
>| Q20 |          13218.122 |            10985.738 |
>| Q21 |            264.639 |              262.350 |
>
>
>The attached main.c is the program I used to detect the
>random_page_cost. result.tar.gz is the test result, you can run a git log
>first
>to see the difference on plan or execution stat.
>
>Any feedback is welcome. Thanks!
>

That seems pretty neat. What kind of hardware have you done these tests
on? It's probably worth testing on various other storage systems to see
how that applies to those.

Have you tried existing I/O testing tools, e.g. fio? If your idea is to
propose some built-in tool (similar to pg_test_fsync) then we probably
should not rely on external tools, but I wonder if we're getting the
same numbers.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From
Andy Fan
Date:
Hi Tomas:
   Thanks for checking.  

On Fri, Sep 18, 2020 at 9:50 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>I recently tried something in this direction and the result looks
>promising based on my limited test.
>
>Since the unit of a xxx_cost is "seq_page_cost", then how to detect
>seq_page_cost is important. In the cost model, the IO cost of a seqscan is
>rel->pages * seq_page_cost, it doesn't consider any cache (file system
>cache or
>shared buffer cache).  However, it assumes the OS will prefetch the IO. So
>to
>detect the seq_page_cost, I enabled the prefetch but avoided the file system
>cache. I tested this with 1). drop the cache on the file system. 2). Open
>the test
>file without O_DIRECT so that the prefetch can work.
>
>To detect the random page read, I read it with pread with a random offset.
>Since the random offsets may be the same as each other during the test,
>so even dropping the file system cache at the beginning doesn't work. so
>I open it with the O_DIRECT option.
>
>I also measure the cost of reading a page from a file system cache, during
>my test, it is about 10% of a seq scan read.
>
>After I get the basic numbers about the hardware capability, I let the user
>provide a cache hit ratio (This is a place where we can further improve if
>this
>is a right direction).
>
>Here is the test result on my hardware.
>
>fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat =
>73.987732us
>
>cache hit ratio: 1.000000 random_page_cost 1.000000
>cache hit ratio: 0.900000 random_page_cost 5.073692
>cache hit ratio: 0.500000 random_page_cost 7.957589
>cache hit ratio: 0.100000 random_page_cost 8.551591
>cache hit ratio: 0.000000 random_page_cost 8.633049
>
>
>Then I tested the suggested value with the 10GB TPCH
>workload. I compared the plans with 2 different settings random_page_cost =
>1). 4 is the default value) 2). 8.6  the cache hint ratio = 0 one.  Then 11
>out of the 22
>queries generated a different plan.  At last I drop the cache (including
>both
>file system cache and shared_buffer) before run each query and run the 11
>queries
>under the 2 different settings. The execution time is below.
>
>
>|     | random_page_cost=4 | random_page_cost=8.6 |
>|-----+--------------------+----------------------|
>| Q1  |           1425.964 |             1121.928 |
>| Q2  |           2553.072 |             2567.450 |
>| Q5  |           4397.514 |             1475.343 |
>| Q6  |          12576.985 |             4622.503 |
>| Q7  |           3459.777 |             2987.241 |
>| Q8  |           8360.995 |             8415.311 |
>| Q9  |           4661.842 |             2930.370 |
>| Q11 |           4885.289 |             2348.541 |
>| Q13 |           2610.937 |             1497.776 |
>| Q20 |          13218.122 |            10985.738 |
>| Q21 |            264.639 |              262.350 |
>
>
>The attached main.c is the program I used to detect the
>random_page_cost. result.tar.gz is the test result, you can run a git log
>first
>to see the difference on plan or execution stat.
>
>Any feedback is welcome. Thanks!
>

That seems pretty neat. What kind of hardware have you done these tests
on?

The following is my hardware info.

I have 12 SSD behind the MR9271-8i RAID Controller which has a 1GB buffer. [1]

root#  lshw -short -C disk
H/W path            Device          Class          Description
==============================================================
/0/100/2/0/2.0.0    /dev/sda        disk           2398GB MR9271-8i
/0/100/2/0/2.1.0    /dev/sdb        disk           5597GB MR9271-8i <-- my data location


/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL

Adapter #0

Memory Size      : 1024MB
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
..
Current Cache Policy: WriteBack, ReadAheadNone, Direct, Write Cache OK if Bad
BBU
...
                Device Present
                ================
Virtual Drives    : 2
  Degraded        : 0
  Offline         : 0
Physical Devices  : 14
  Disks           : 12
  Critical Disks  : 0
  Failed Disks    : 0


root# /opt/MegaRAID/MegaCli/MegaCli64  -LdPdInfo -a0 | egrep 'Media Type|Raw Size'
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device

CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 32 processors.
Memory: 251 GB
Linux: 3.10.0-327
fs: ext4. mount options: defaults,noatime,nodiratime,nodelalloc,barrier=0
Physical machine.

It's probably worth testing on various other storage systems to see
how that applies to those.

Yes, I can test more on new hardware once I get it. Now it is still in progress. 
However I can only get a physical machine with SSD or  Virtual machine with
SSD, other types are hard for me right now. 


Have you tried existing I/O testing tools, e.g. fio? If your idea is to
propose some built-in tool (similar to pg_test_fsync) then we probably
should not rely on external tools, but I wonder if we're getting the
same numbers.

Thanks for this hint, I found more interesting stuff during the comparison. 

I define the FIO jobs as below.

random_page_cost.job:
[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio

[random_page_cost]
direct=1
readwrite=randread


Even it is direct IO, the device cache still plays an important
part. The device cache is filled in preparing the test data file stage.
I invalidate the device cache by writing a new dummy file. At last the avg
latency time is 148 us.


seq.job

[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio

[seq_page_cost]
buffered=1
readwrite=read

For seq read, We need buffered IO for perfetch, however, we need to bypass the file
system cache and device cache. fio have no control of such caches, so I did:

1). Run fio to generate the test file.
2). Invalidate device cache first with dd if=/dev/zero of=a_dummy_file bs=1048576 count=1024
3). drop the file system cache.
4). Run the fio again.

The final avg latency is ~12 us.

This is 1.5 ~ 2 X difference with my previous result. (seq_read_lat = 8.570290us, random_page_lat =
73.987732us)

Here are some changes for my detection program.

|                                                 | seq_read_lat (us) | random_read_lat (us) |
| FIO                                             |                12 |                  148 |
| Previous main.c                                 |               8.5 |                   74 |
| invalidate_device_cache before each testing     |                 9 |                  150 |
| prepare the test data file with O_DIRECT option |                15 |                  150 |


In invalidate_device_cache, I just create another 1GB data file and read
it. (see invalidate_device_cache function) this is similar as the previous fio setup.

prepare test data file with O_DIRECT option means in the past, I prepare the test
file with buffer IO. and before testing, I do invalidate device cache, file
system cache. but the buffered prepared file still get better performance, I
have no idea of it. Since I don't want any cache.  I use O_DIRECT
option at last. The seq_read_lat changed from 9us to 15us. 
I still can't find out the 25% difference with the FIO result. (12 us vs 9 us).

At last, the random_page_cost happens to not change very much.

/u/y/g/fdirect> sudo ./main
fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat = 148.650589us

cache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 6.401019
cache hit ratio: 0.500000 random_page_cost 7.663772
cache hit ratio: 0.100000 random_page_cost 7.841498
cache hit ratio: 0.000000 random_page_cost 7.864383

This result looks much different from "we should use 1.1 ~ 1.5 for SSD".

The attached is the modified detection program.
 

--
Best Regards
Andy Fan
Attachment

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From
Ashutosh Bapat
Date:
On Mon, Sep 21, 2020 at 9:11 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> Here are some changes for my detection program.
>
> |                                                 | seq_read_lat (us) | random_read_lat (us) |
> | FIO                                             |                12 |                  148 |
> | Previous main.c                                 |               8.5 |                   74 |
> | invalidate_device_cache before each testing     |                 9 |                  150 |
> | prepare the test data file with O_DIRECT option |                15 |                  150 |
>
> In invalidate_device_cache, I just create another 1GB data file and read
> it. (see invalidate_device_cache function) this is similar as the previous fio setup.
>
> prepare test data file with O_DIRECT option means in the past, I prepare the test
> file with buffer IO. and before testing, I do invalidate device cache, file
> system cache. but the buffered prepared file still get better performance, I
> have no idea of it. Since I don't want any cache.  I use O_DIRECT
> option at last. The seq_read_lat changed from 9us to 15us.
> I still can't find out the 25% difference with the FIO result. (12 us vs 9 us).
>
> At last, the random_page_cost happens to not change very much.
>
> /u/y/g/fdirect> sudo ./main
> fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat = 148.650589us
>
> cache hit ratio: 1.000000 random_page_cost 1.000000
> cache hit ratio: 0.900000 random_page_cost 6.401019
> cache hit ratio: 0.500000 random_page_cost 7.663772
> cache hit ratio: 0.100000 random_page_cost 7.841498
> cache hit ratio: 0.000000 random_page_cost 7.864383
>
> This result looks much different from "we should use 1.1 ~ 1.5 for SSD".
>

Very interesting. Thanks for working on this. In an earlier email you
mentioned that TPCH plans changed to efficient ones when you changed
random_page_cost = =8.6 from 4 and seq_page_cost was set to 1. IIUC,
setting random_page_cost to seq_page_cost to the same ratio as that
between the corresponding latencies improved the plans. How about
trying this with that ratio set to the one obtained from the latencies
provided by FIO? Do we see any better plans?

page cost is one thing, but there are CPU costs also involved in costs
of expression evaluation. Should those be changed accordingly to the
CPU latency?

-- 
Best Wishes,
Ashutosh Bapat



Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From
Andy Fan
Date:
Thanks Ashutosh for coming:)

On Mon, Sep 21, 2020 at 9:03 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Mon, Sep 21, 2020 at 9:11 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> Here are some changes for my detection program.
>
> |                                                 | seq_read_lat (us) | random_read_lat (us) |
> | FIO                                             |                12 |                  148 |
> | Previous main.c                                 |               8.5 |                   74 |
> | invalidate_device_cache before each testing     |                 9 |                  150 |
> | prepare the test data file with O_DIRECT option |                15 |                  150 |
>
> In invalidate_device_cache, I just create another 1GB data file and read
> it. (see invalidate_device_cache function) this is similar as the previous fio setup.
>
> prepare test data file with O_DIRECT option means in the past, I prepare the test
> file with buffer IO. and before testing, I do invalidate device cache, file
> system cache. but the buffered prepared file still get better performance, I
> have no idea of it. Since I don't want any cache.  I use O_DIRECT
> option at last. The seq_read_lat changed from 9us to 15us.
> I still can't find out the 25% difference with the FIO result. (12 us vs 9 us).
>
> At last, the random_page_cost happens to not change very much.
>
> /u/y/g/fdirect> sudo ./main
> fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat = 148.650589us
>
> cache hit ratio: 1.000000 random_page_cost 1.000000
> cache hit ratio: 0.900000 random_page_cost 6.401019
> cache hit ratio: 0.500000 random_page_cost 7.663772
> cache hit ratio: 0.100000 random_page_cost 7.841498
> cache hit ratio: 0.000000 random_page_cost 7.864383
>
> This result looks much different from "we should use 1.1 ~ 1.5 for SSD".
>

Very interesting. Thanks for working on this. In an earlier email you
mentioned that TPCH plans changed to efficient ones when you changed
random_page_cost = =8.6 from 4 and seq_page_cost was set to 1. IIUC,
setting random_page_cost to seq_page_cost to the same ratio as that
between the corresponding latencies improved the plans.

Yes. 

How about
trying this with that ratio set to the one obtained from the latencies
provided by FIO? Do we see any better plans?

My tools set the random_page_cost to 8.6, but based on the fio data, it should be
set to 12.3 on the same hardware.  and I do see the better plan as well with 12.3.
Looks too smooth to believe it is true..

The attached result_fio_mytool.tar.gz is my test result.   You can use git show HEAD^^
is the original plan with 8.6.  git show HEAD^  show the plan changes after we changed
the random_page_cost. git show HEAD shows the run time statistics changes for these queries. 
I also uploaded the test tool [1] for this for your double check. 


|     |      8.6 |     12.3 |

|-----+----------+----------|

| Q2  | 2557.064 | 2444.995 |

| Q4  | 3544.606 | 3148.884 |

| Q7  | 2965.820 | 2240.185 |

| Q14 | 4988.747 | 4931.375 |

 

page cost is one thing, but there are CPU costs also involved in costs
of expression evaluation. Should those be changed accordingly to the
CPU latency?

Yes, we need that as well.  At the beginning of this thread, I treat all of them equally. 
In the first reply of Tomas,  he mentioned random_page_cost specially. After
~10 months, I tested TPCH on a hardware and then found random_page_cost
is set too incorrectly, after fixing it,  the result looks much better.  So I'd like to work
on this special thing first. 

 
--
Best Regards
Andy Fan
Attachment

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From
Andy Fan
Date:

It's probably worth testing on various other storage systems to see
how that applies to those.

Yes, I can test more on new hardware once I get it. Now it is still in progress. 
However I can only get a physical machine with SSD or  Virtual machine with
SSD, other types are hard for me right now. 

 
Here is a result on a different hardware.   The test method is still not changed.[1] 

Hardware Info:

Virtual Machine with 61GB memory.
Linux Kernel: 5.4.0-31-generic  Ubuntu

# lshw -short -C disk
H/W path        Device     Class          Description
=====================================================
/0/100/4/0      /dev/vda   disk           42GB Virtual I/O device
/0/100/5/0      /dev/vdb   disk           42GB Virtual I/O device

The disk on the physical machine is claimed as SSD.

This time the FIO and my tools can generate the exact same result.

fs_cache_lat = 0.957756us, seq_read_lat = 70.780327us, random_page_lat = 438.837257us

cache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 5.635470
cache hit ratio: 0.500000 random_page_cost 6.130565
cache hit ratio: 0.100000 random_page_cost 6.192183
cache hit ratio: 0.000000 random_page_cost 6.199989

|         | seq_read_lat(us) | random_read_lat(us) |
| FIO     |               70 |                 437 |
| MY Tool |               70 |                 438 |


The following query plans have changed because we change random_page_cost to 4
to 6.2, the Execution time also changed.

|     | random_page_cost=4 | random_page_cost=6.2 |
|-----+--------------------+----------------------|
| Q1  |               2561 |             2528.272 |
| Q10 |           4675.749 |             4684.225 |
| Q13 |          18858.048 |            18565.929 |
| Q2  |            329.279 |              308.723 |
| Q5  |          46248.132 |             7900.173 |
| Q6  |          52526.462 |            47639.503 |
| Q7  |          27348.900 |            25829.221 |


Q5 improved by 5.8 times and Q6 & Q7 improved by ~10%.

-- 
Best Regards
Andy Fan

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From
Ashutosh Bapat
Date:
On Tue, Sep 22, 2020 at 10:57 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
> My tools set the random_page_cost to 8.6, but based on the fio data, it should be
> set to 12.3 on the same hardware.  and I do see the better plan as well with 12.3.
> Looks too smooth to believe it is true..
>
> The attached result_fio_mytool.tar.gz is my test result.   You can use git show HEAD^^
> is the original plan with 8.6.  git show HEAD^  show the plan changes after we changed
> the random_page_cost. git show HEAD shows the run time statistics changes for these queries.
> I also uploaded the test tool [1] for this for your double check.

The scripts seem to start and stop the server, drop caches for every
query. That's where you are seeing that setting random_page_cost to
fio based ratio provides better plans. But in practice, these costs
need to be set on a server where the queries are run concurrently and
repeatedly. That's where the caching behaviour plays an important
role. Can we write a tool which can recommend costs for that scenario?
How do the fio based cost perform when the queries are run repeatedly?

-- 
Best Wishes,
Ashutosh Bapat



Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From
Andy Fan
Date:


On Fri, Sep 25, 2020 at 5:15 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Tue, Sep 22, 2020 at 10:57 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
> My tools set the random_page_cost to 8.6, but based on the fio data, it should be
> set to 12.3 on the same hardware.  and I do see the better plan as well with 12.3.
> Looks too smooth to believe it is true..
>
> The attached result_fio_mytool.tar.gz is my test result.   You can use git show HEAD^^
> is the original plan with 8.6.  git show HEAD^  show the plan changes after we changed
> the random_page_cost. git show HEAD shows the run time statistics changes for these queries.
> I also uploaded the test tool [1] for this for your double check.

The scripts seem to start and stop the server, drop caches for every
query. That's where you are seeing that setting random_page_cost to
fio based ratio provides better plans. But in practice, these costs
need to be set on a server where the queries are run concurrently and
repeatedly. That's where the caching behaviour plays an important 
role. Can we write a tool which can recommend costs for that scenario?
 
I totally agree with this. Actually the first thing I did is to define a
proper IO workload. At the very beginning, I used DIRECT_IO for both seq read
and random read on my SSD, and then found the result is pretty bad per testing
(random_page_cost = ~1.6).  then I realized postgresql relies on the prefetch
which is disabled by DIRECT_IO. After I fixed this, I tested again with the above
scenario (cache hit ratio = 0) to verify my IO model. Per testing, it looks good. 
I am also thinking if the random_page_cost = 1.1 doesn't provide a good result
on my SSD because it ignores the prefects of seq read.

After I am OK with my IO model, I test with the way you see above. but 
I also detect the latency for file system cache hit, which is handled by 
get_fs_cache_latency_us in my code (I ignored the shared buffer hits for now).
and allows user to provides a cache_hit_ratio, the final random_page_cost 
= (real_random_lat) / real_seq_lat, where 
real_xxx_lat = cache_hit_ratio * fs_cache_lat + (1 - cache_hit_ratio) * xxx_lat. 
See function cal_real_lat and cal_random_page_cost.

As for the testing with cache considered, I found how to estimate cache hit
ratio is hard or how to control a hit ratio to test is hard. Recently I am thinking
a method that we can get a page_reads, shared_buffer_hit from pg_kernel
and the real io (without the file system cache hit) at os level (just as what 
iotop/pidstat do). then we can know the shared_buffer hit ratio and file system
cache hit ratio (assume it will be stable after a long run). and then do a testing.
However this would be another branch of manual work and I still have not got
it done until now.

I'd not like to share too many details, but "lucky" many cases I have haven't file
system cache, that makes things a bit easier. What I am doing right now is to 
calculate the random_page_cost with the above algorithm with only shared_buffer
considered.  and test the real benefits with real workload to see how it works. 
If it works well, I think the only thing left is to handle file system cache.

The testing is time consuming since I have to cooperate with many site engineers, 
so any improvement on the design will be much helpful. 
 
How do the fio based cost perform when the queries are run repeatedly?


That probably is not good since I have 280G+ file system cache and I have to
prepare much more than 280G data size for testing. 


--
Best Regards
Andy Fan

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From
Julien Rouhaud
Date:
On Sat, Sep 26, 2020 at 8:17 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> As for the testing with cache considered, I found how to estimate cache hit
> ratio is hard or how to control a hit ratio to test is hard. Recently I am thinking
> a method that we can get a page_reads, shared_buffer_hit from pg_kernel
> and the real io (without the file system cache hit) at os level (just as what
> iotop/pidstat do). then we can know the shared_buffer hit ratio and file system
> cache hit ratio (assume it will be stable after a long run). and then do a testing.
> However this would be another branch of manual work and I still have not got
> it done until now.

FWIW pg_stat_kcache [1] extension accumulates per (database, user,
queryid) physical reads and writes, so you can easily compute a
shared_buffers / IO cache / disk hit ratio.

[1] https://github.com/powa-team/pg_stat_kcache



Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From
Andy Fan
Date:


On Sat, Sep 26, 2020 at 1:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
On Sat, Sep 26, 2020 at 8:17 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> As for the testing with cache considered, I found how to estimate cache hit
> ratio is hard or how to control a hit ratio to test is hard. Recently I am thinking
> a method that we can get a page_reads, shared_buffer_hit from pg_kernel
> and the real io (without the file system cache hit) at os level (just as what
> iotop/pidstat do). then we can know the shared_buffer hit ratio and file system
> cache hit ratio (assume it will be stable after a long run). and then do a testing.
> However this would be another branch of manual work and I still have not got
> it done until now.

FWIW pg_stat_kcache [1] extension accumulates per (database, user,
queryid) physical reads and writes, so you can easily compute a
shared_buffers / IO cache / disk hit ratio.

[1] https://github.com/powa-team/pg_stat_kcache

WOW, this would be a good extension for this purpose.  Thanks for sharing it. 

--
Best Regards
Andy Fan