Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Date
Msg-id 510F062F.1080001@fuzzy.cz
Whole thread Raw
In response to Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
List pgsql-hackers
On 3.2.2013 21:54, Jeff Janes wrote:
> On Sat, Feb 2, 2013 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> On 3.1.2013 20:33, Magnus Hagander wrote:
>>>>
>>>> Yeah, +1 for a separate directory not in global.
>>>
>>> OK, I moved the files from "global/stat" to "stat".
>>
>> This has a warning:
>>
>> pgstat.c:5132: warning: 'pgstat_write_statsfile_needed' was used with
>> no prototype before its definition
>>
>> I plan to do some performance testing, but that will take a while so I
>> wanted to post this before I get distracted.
>
> Running "vacuumdb -a" on a cluster with 1000 db with 200 tables (x
> serial primary key) in each, I get log messages like this:
>
> last_statwrite 23682-06-18 22:36:52.960194-07 is later than
> collector's time 2013-02-03 12:49:19.700629-08 for db 16387
>
> Note the bizarre year in the first time stamp.
>
> If it matters, I got this after shutting down the cluster, blowing
> away $DATA/stat/*, then restarting it and invoking vacuumdb.

I somehow expected that hash_search zeroes all the fields of a new
entry, but looking at pgstat_get_db_entry that obviously is not the
case. So stats_timestamp (which tracks timestamp of the last write for a
DB) was random - that's where the bizarre year values came from.

I've added a proper initialization (to 0), and now it works as expected.

Although the whole sequence of errors I was getting was this:

LOG:  last_statwrite 11133-08-28 19:22:31.711744+02 is later than
collector's time 2013-02-04 00:54:21.113439+01 for db 19093
WARNING:  pgstat wait timeout
LOG:  last_statwrite 39681-12-23 18:48:48.9093+01 is later than
collector's time 2013-02-04 00:54:31.424681+01 for db 46494
FATAL:  could not find block containing chunk 0x2af4a60
LOG:  statistics collector process (PID 10063) exited with exit code 1
WARNING:  pgstat wait timeout
WARNING:  pgstat wait timeout

I'm not entirely sure where the FATAL came from, but it seems it was
somehow related to the issue - it was quite reproducible, although I
don't see how exactly could this happen. There relevant block of code
looks like this:

    char       *writetime;
    char       *mytime;

    /* Copy because timestamptz_to_str returns a static buffer */
    writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
    mytime = pstrdup(timestamptz_to_str(cur_ts));
    elog(LOG, "last_statwrite %s is later than collector's time %s for "
        "db %d", writetime, mytime, dbentry->databaseid);
    pfree(writetime);
    pfree(mytime);

which seems quite fine to mee. I'm not sure how one of the pfree calls
could fail?

Anyway, attached is a patch that fixes all three issues, i.e.

1) the un-initialized timestamp
2) the "void" ommited from the signature
3) rename to "pg_stat" instead of just "stat"


Tomas

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Next
From: Robert Haas
Date:
Subject: Re: cannot move relocatable extension out of pg_catalog schema