Re: [PERFORM] 8.3beta1 testing on Solaris - Mailing list pgsql-hackers

From Jignesh K. Shah
Subject Re: [PERFORM] 8.3beta1 testing on Solaris
Date
Msg-id 472219D6.6030001@sun.com
Whole thread Raw
In response to Re: [PERFORM] 8.3beta1 testing on Solaris  ("Jignesh K. Shah" <J.K.Shah@Sun.COM>)
Responses Re: [PERFORM] 8.3beta1 testing on Solaris
List pgsql-hackers
Also to give perspective on the equivalent writes on CLOG

I used the following script which runs for 10 sec to track all writes to
the clog directory and here is what it came up with... (This is with 500
users running)

# cat write.d
#!/usr/sbin/dtrace -s
syscall::write:entry
/execname=="postgres" &&
dirname(fds[arg0].fi_pathname)=="/export/home0/igen/pgdata/pg_clog"/
{
    @write[fds[arg0].fi_pathname,arg1] = count();
}
tick-10sec
{
exit(0);
}

# ./write.d
dtrace: script './write.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  3   1026                      :tick-10sec

  /export/home0/igen/pgdata/pg_clog/001E
-2753028277088                1
#
I modified read.d to do a 5sec read
# ./read.d
dtrace: script './read.d' matched 3 probes
CPU     ID                    FUNCTION:NAME
  0      1                           :BEGIN
  0   1027                       :tick-5sec

  /export/home0/igen/pgdata/pg_clog/001F
-2753028268896                1
  /export/home0/igen/pgdata/pg_clog/001F
-2753028252512                1
  /export/home0/igen/pgdata/pg_clog/001F
-2753028285280                2
  /export/home0/igen/pgdata/pg_clog/001F
-2753028277088                3
  /export/home0/igen/pgdata/pg_clog/001F
-2753028236128                3
  /export/home0/igen/pgdata/pg_clog/001E
-2753028285280                5
  /export/home0/igen/pgdata/pg_clog/001E
-2753028236128                9
  /export/home0/igen/pgdata/pg_clog/001E
-2753028277088               13
  /export/home0/igen/pgdata/pg_clog/001E
-2753028268896               15
  /export/home0/igen/pgdata/pg_clog/001E
-2753028252512               27
#

So the ratio of reads vs writes to clog files is pretty huge..


-Jignesh



Jignesh K. Shah wrote:
> Tom,
>
> Here is what I did:
>
> I started aggregating all read information:
>
> First I also had added group by pid    (arg0,arg1, pid) and the counts
> were all coming as 1
>
> Then I just grouped by filename and location (arg0,arg1 of reads) and
> the counts came back as
>
> # cat read.d
> #!/usr/sbin/dtrace -s
> syscall::read:entry
> /execname=="postgres"/
> {
>    @read[fds[arg0].fi_pathname, arg1] = count();
> }
>
>
> # ./read.d
> dtrace: script './read.d' matched 1 probe
> ^C
>
>  /export/home0/igen/pgdata/pg_clog/0014
> -2753028293472                1
>  /export/home0/igen/pgdata/pg_clog/0014
> -2753028277088                1
>  /export/home0/igen/pgdata/pg_clog/0015
> -2753028244320                2
>  /export/home0/igen/pgdata/pg_clog/0015
> -2753028268896               14
>  /export/home0/igen/pgdata/pg_clog/0015
> -2753028260704               25
>  /export/home0/igen/pgdata/pg_clog/0015
> -2753028252512               27
>  /export/home0/igen/pgdata/pg_clog/0015
> -2753028277088               28
>  /export/home0/igen/pgdata/pg_clog/0015
> -2753028293472               37
>
>
> FYI  I pressed ctrl-c within like less than a second
>
> So to me this seems that multiple processes are reading the same page
> from different pids. (This was with about 600 suers active.
>
> Aparently we do have a problem that we are reading the same buffer
> address again.  (Same as not being cached anywhere or not finding it
> in cache anywhere).
>
> I reran lock wait script on couple of processes and did not see
> CLogControlFileLock  as a problem..
>
> # ./83_lwlock_wait.d 14341
>
>             Lock Id            Mode           Count
>       WALInsertLock       Exclusive               1
>       ProcArrayLock       Exclusive              16
>
>             Lock Id   Combined Time (ns)
>       WALInsertLock               383109
>       ProcArrayLock            198866236
>
> # ./83_lwlock_wait.d 14607
>
>             Lock Id            Mode           Count
>       WALInsertLock       Exclusive               2
>       ProcArrayLock       Exclusive              15
>
>             Lock Id   Combined Time (ns)
>       WALInsertLock                55243
>       ProcArrayLock             69700140
>
> #
>
> What will help you find out why it is reading the same page again?
>
>
> -Jignesh
>
>
>
> Jignesh K. Shah wrote:
>> I agree with Tom..  somehow I think  increasing NUM_CLOG_BUFFERS is
>> just avoiding the symptom to a later value.. I promise to look more
>> into it before making any recommendations to increase NUM_CLOG_BUFFERs.
>>
>>
>> Because though "iGen"  showed improvements in that area by increasing
>> num_clog_buffers , EAStress had shown no improvements.. Plus the
>> reason I think this is not the problem in 8.3beta1 since the Lock
>> Output clearly does not show CLOGControlFile as to be the issue which
>> I had seen in earlier case.  So I dont think that increasing
>> NUM_CLOG_BUFFERS will change thing here.
>>
>> Now I dont understand the code pretty well yet I see three hotspots
>> and not sure if they are related to each other
>> * ProcArrayLock waits  - causing Waits          as reported by
>> 83_lockwait.d script
>> * SimpleLRUReadPage - causing read IOs             as reported by
>> iostat/rsnoop.d
>> * GetSnapshotData - causing CPU utilization  as reported by hotuser
>>
>> But I will shut up and do more testing.
>>
>> Regards,
>> Jignesh
>>
>>
>>
>> Tom Lane wrote:
>>> Josh Berkus <josh@agliodbs.com> writes:
>>>
>>>> Actually, 32 made a significant difference as I recall ... do you
>>>> still have the figures for that, Jignesh?
>>>>
>>>
>>> I'd want to see a new set of test runs backing up any call for a change
>>> in NUM_CLOG_BUFFERS --- we've changed enough stuff around this area
>>> that
>>> benchmarks using code from a few months back shouldn't carry a lot of
>>> weight.
>>>
>>>             regards, tom lane
>>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 5: don't forget to increase your free space map settings

pgsql-hackers by date:

Previous
From: "Magnus Hagander"
Date:
Subject: Re: 8.2.3: Server crashes on Windows using Eclipse/Junit
Next
From: "Henry B. Hotz"
Date:
Subject: Re: 8.3 GSS Issues