Thread: pgstat wait timeout (RE: contrib/cache_scan)

pgstat wait timeout (RE: contrib/cache_scan)

From

Kouhei Kaigai

Date:

12 March 2014, 13:55:22

It is another topic from the main thread,

I noticed the following message under the test cases that
takes heavy INSERT workload; provided by Haribabu.

[kaigai@iwashi ~]$ createdb mytest
[kaigai@iwashi ~]$ psql -af ~/cache_scan.sql mytest
\timing
Timing is on.
--cache scan select 5 million
create table test(f1 int, f2 char(70), f3 float, f4 char(100));
CREATE TABLE
Time: 22.373 ms
truncate table test;
TRUNCATE TABLE
Time: 17.705 ms
insert into test values (generate_series(1,5000000), 'fujitsu', 1.1, 'Australia software tech pvt ltd');
WARNING:  pgstat wait timeout
WARNING:  pgstat wait timeout
WARNING:  pgstat wait timeout
WARNING:  pgstat wait timeout  :

Once I got above messages, write performance is dramatically
degraded, even though I didn't take detailed investigation.

I could reproduce it on the latest master branch without my
enhancement, so I guess it is not a problem something special
to me.
One other strangeness is, right now, this problem is only
happen on my virtual machine environment - VMware ESXi 5.5.0.
I couldn't reproduce the problem on my physical environment
(Fedora20, core i5-4570S).
Any ideas?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Wednesday, March 12, 2014 3:26 PM
> To: Haribabu Kommi; Kohei KaiGai
> Cc: Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> Thanks for your efforts!
> >                                          Head          patched
> > Diff
> > Select -  500K                772ms        2659ms        -200%
> > Insert - 400K                   3429ms     1948ms          43% (I am
> > not sure how it improved in this case)
> > delete - 200K                 2066ms     3978ms        -92%
> > update - 200K                3915ms      5899ms        -50%
> >
> > This patch shown how the custom scan can be used very well but coming
> > to patch as It is having some performance problem which needs to be
> > investigated.
> >
> > I attached the test script file used for the performance test.
> >
> First of all, it seems to me your test case has too small data set that
> allows to hold all the data in memory - briefly 500K of 200bytes record
> will consume about 100MB. Your configuration allocates 512MB of
> shared_buffer, and about 3GB of OS-level page cache is available.
> (Note that Linux uses free memory as disk cache adaptively.)
>
> This cache is designed to hide latency of disk accesses, so this test case
> does not fit its intention.
> (Also, the primary purpose of this module is a demonstration for
> heap_page_prune_hook to hook vacuuming, so simple code was preferred than
> complicated implementation but better performance.)
>
> I could reproduce the overall trend, no cache scan is faster than cached
> scan if buffer is in memory. Probably, it comes from the cost to walk down
> T-tree index using ctid per reference.
> Performance penalty around UPDATE and DELETE likely come from trigger
> invocation per row.
> I could observe performance gain on INSERT a little bit.
> It's strange for me, also. :-(
>
> On the other hand, the discussion around custom-plan interface effects this
> module because it uses this API as foundation.
> Please wait for a few days to rebase the cache_scan module onto the newer
> custom-plan interface; that I submitted just a moment before.
>
> Also, is it really necessary to tune the performance stuff in this example
> module of the heap_page_prune_hook?
> Even though I have a few ideas to improve the cache performance, like
> insertion of multiple rows at once or local chunk copy instead of t-tree
> walk down, I'm not sure whether it is productive in the current v9.4
> timeframe. ;-(
>
> Thanks,
> --
> NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei
> <kaigai@ak.jp.nec.com>
>
>
> > -----Original Message-----
> > From: pgsql-hackers-owner@postgresql.org
> > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Haribabu
> > Kommi
> > Sent: Wednesday, March 12, 2014 1:14 PM
> > To: Kohei KaiGai
> > Cc: Kaigai Kouhei(海外 浩平); Tom Lane; PgHacker; Robert Haas
> > Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for
> > cache-only table scan?)
> >
> > On Thu, Mar 6, 2014 at 10:15 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
> > > 2014-03-06 18:17 GMT+09:00 Haribabu Kommi <kommi.haribabu@gmail.com>:
> > >> I will update you later regarding the performance test results.
> > >>
> >
> > I ran the performance test on the cache scan patch and below are the
> readings.
> >
> > Configuration:
> >
> > Shared_buffers - 512MB
> > cache_scan.num_blocks - 600
> > checkpoint_segments - 255
> >
> > Machine:
> > OS - centos - 6.4
> > CPU - 4 core 2.5 GHZ
> > Memory - 4GB
> >
> >                                          Head          patched
> > Diff
> > Select -  500K                772ms        2659ms        -200%
> > Insert - 400K                   3429ms     1948ms          43% (I am
> > not sure how it improved in this case)
> > delete - 200K                 2066ms     3978ms        -92%
> > update - 200K                3915ms      5899ms        -50%
> >
> > This patch shown how the custom scan can be used very well but coming
> > to patch as It is having some performance problem which needs to be
> > investigated.
> >
> > I attached the test script file used for the performance test.
> >
> > Regards,
> > Hari Babu
> > Fujitsu Australia
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Re: pgstat wait timeout (RE: contrib/cache_scan)

From

Tom Lane

Date:

12 March 2014, 14:42:18

Kouhei Kaigai <kaigai@ak.jp.nec.com> writes:
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout

> Once I got above messages, write performance is dramatically
> degraded, even though I didn't take detailed investigation.

> I could reproduce it on the latest master branch without my
> enhancement, so I guess it is not a problem something special
> to me.
> One other strangeness is, right now, this problem is only
> happen on my virtual machine environment - VMware ESXi 5.5.0.
> I couldn't reproduce the problem on my physical environment
> (Fedora20, core i5-4570S).

We've seen sporadic reports of that sort of behavior for years, but no
developer has ever been able to reproduce it reliably.  Now that you've
got a reproducible case, do you want to poke into it and see what's going
on?
        regards, tom lane

Re: pgstat wait timeout (RE: contrib/cache_scan)

From

"Tomas Vondra"

Date:

12 March 2014, 15:26:26

On 12 Březen 2014, 14:54, Kouhei Kaigai wrote:
> It is another topic from the main thread,
>
> I noticed the following message under the test cases that
> takes heavy INSERT workload; provided by Haribabu.
>
> [kaigai@iwashi ~]$ createdb mytest
> [kaigai@iwashi ~]$ psql -af ~/cache_scan.sql mytest
> \timing
> Timing is on.
> --cache scan select 5 million
> create table test(f1 int, f2 char(70), f3 float, f4 char(100));
> CREATE TABLE
> Time: 22.373 ms
> truncate table test;
> TRUNCATE TABLE
> Time: 17.705 ms
> insert into test values (generate_series(1,5000000), 'fujitsu', 1.1,
> 'Australia software tech pvt ltd');
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
>    :
>
> Once I got above messages, write performance is dramatically
> degraded, even though I didn't take detailed investigation.
>
> I could reproduce it on the latest master branch without my
> enhancement, so I guess it is not a problem something special
> to me.
> One other strangeness is, right now, this problem is only
> happen on my virtual machine environment - VMware ESXi 5.5.0.
> I couldn't reproduce the problem on my physical environment
> (Fedora20, core i5-4570S).
> Any ideas?

I've seen this happening in cases when it was impossible to write
the stat file for some reason. IIRC there were two basic causes I've seen
in the past:

(1) writing the stat copy failed - for example when the temporary stat
directory was placed in tmpfs, but it was too small

(2) writing the stat copy took too long - e.g. with tmpfs and memory
pressure, forcing the system to swap to free space for the stat copy

(3) IIRC the inquiry (backend -> postmaster) to write the file is sent
using UDP, which may be dropped in some cases (e.g. when the system is
overloaded), so the postmaster does not even know it should write the file

I'm not familiar with VMware ESXi virtualization, but I suppose it might
be relevant to all three causes.

regards
Tomas

Re: pgstat wait timeout (RE: contrib/cache_scan)

From

Jeff Janes

Date:

12 March 2014, 16:42:28

On Wed, Mar 12, 2014 at 7:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kouhei Kaigai <kaigai@ak.jp.nec.com> writes:
> WARNING: pgstat wait timeout
> WARNING: pgstat wait timeout
> WARNING: pgstat wait timeout
> WARNING: pgstat wait timeout

> Once I got above messages, write performance is dramatically
> degraded, even though I didn't take detailed investigation.

> I could reproduce it on the latest master branch without my
> enhancement, so I guess it is not a problem something special
> to me.
> One other strangeness is, right now, this problem is only
> happen on my virtual machine environment - VMware ESXi 5.5.0.
> I couldn't reproduce the problem on my physical environment
> (Fedora20, core i5-4570S).

We've seen sporadic reports of that sort of behavior for years, but no
developer has ever been able to reproduce it reliably. Now that you've
got a reproducible case, do you want to poke into it and see what's going
on?

I didn't know we were trying to reproduce it, nor that it was a mystery. Do anything that causes serious IO constipation, and you will probably see that message. For example, turn off synchronous_commit and run the default pgbench transaction at a large scale but that still comfortably fits in RAM, and wait for a checkpoint sync phase to kick in.

The pgstat wait timeout is a symptom, not the cause.

Cheers,

Jeff

Re: pgstat wait timeout (RE: contrib/cache_scan)

From

Tom Lane

Date:

12 March 2014, 16:51:53

Jeff Janes <jeff.janes@gmail.com> writes:
> On Wed, Mar 12, 2014 at 7:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We've seen sporadic reports of that sort of behavior for years, but no
>> developer has ever been able to reproduce it reliably.  Now that you've
>> got a reproducible case, do you want to poke into it and see what's going
>> on?

> I didn't know we were trying to reproduce it, nor that it was a mystery.
>  Do anything that causes serious IO constipation, and you will probably see
> that message.

The cases that are a mystery to me are where there's no reason to believe
that I/O is particularly overloaded.  But perhaps Kaigai-san's example is
only that ...
        regards, tom lane