Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement) - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Date
Msg-id alpine.DEB.2.02.1306142227180.10940@localhost6.localdomain6
Whole thread Raw
In response to Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-hackers
>> I think that the weirdness really comes from the way transactions times
>> are measured, their interactions with throttling, and latent bugs in the
>> code.
>
> measurement times, no; interactions with throttling, no.  If it was either of 
> those I'd have finished this off days ago.  Latent bugs, possibly.  We may 
> discover there's nothing wrong with your code at the end here,

To summarize my point: I think my v10 code does not take into account all 
of the strangeness in doCustom, and I'm pretty sure that there is no point
in including thottle sleeps into latency measures, which was more or less 
the case. So it is somehow a "bug" which only shows up if you look at the 
latency measures, but the tps are fine.

> that it just makes hitting this bug more likely. Unfortunately today is 
> the day *some* bug is popping up, and I want to get it squashed before 
> I'll be happy.
>
> The lag is actually happening during a kernel call that isn't working as 
> expected.  I'm not sure whether this bug was there all along if \sleep was 
> used, or if it's specific to the throttle sleep.

The throttle sleep is inserted out of the state machine. That is why in 
the "test" patch I added a goto to ensure that it is always taken at the 
right time, that is when state==0 and before txn_begin is set, and not 
possibly between other states when doCustom happens to be recalled after a 
return.

> I added a bunch more logging as pgbench steps through its work to track down 
> where it's stuck at.  Until the end all transactions look like this:
>
> 1371238832.084783 client 10 throttle lag 2 us
> 1371238832.084783 client 10 executing \setrandom aid 1 :naccounts
> 1371238832.084803 client 10 sending SELECT abalance FROM pgbench_accounts 
> WHERE aid = 753099;
> 1371238832.084840 calling select
> 1371238832.086539 client 10 receiving
> 1371238832.086539 client 10 finished
>
> All clients who hit lag spikes at the end are going through this sequence    > instead:
>
> 1371238832.085912 client 13 throttle lag 790 us
> 1371238832.085912 client 13 executing \setrandom aid 1 :naccounts
> 1371238832.085931 client 13 sending SELECT abalance FROM pgbench_accounts 
> WHERE aid = 564894;
> 1371238832.086592 client 13 receiving
> 1371238832.086662 calling select
> 1371238832.235543 client 13 receiving
> 1371238832.235543 client 13 finished

> Note the "calling select" here that wasn't in the normal length transaction 
> before it.  The client is receiving something here, but rather than it 
> finishing the transaction it falls through and ends up at the select() system 
> call outside of doCustom.  All of the clients that are sleeping when the 
> system slips into one of these long select() calls are getting stuck behind 
> it.  I'm not 100% sure, but I think this only happens when all remaining 
> clients are sleeping.

Note: in both the slow cases there is a "receiving" between "sending" and 
"select". This suggests that the "goto top" at the very end of doCustom is 
followed in one case but not the other.

ISTM that there is a timeout passed to select which is computed based on 
the current sleeping time of each client. I'm pretty sure that not a well 
tested path...

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Next
From: Robert Haas
Date:
Subject: dynamic background workers