Thread: Hot Standby performance issue

Hot Standby performance issue

From
sparikh
Date:
Hi, This is my first post ever in the Postgres forum. I am relatively new to Postgres and coming from oracle background. We have hot stand by setup to serve mainly the read only queries. Past few days we have been facing a performance issues on one of the transaction search. The search mainly utilizes 3 of our biggest transaction tables. We had recently crash on both primary and standby because of the space issues. Both servers were brought up and running successfully after that incident. The standby is in almost in sync with primary, far behind by less than a second. I also rebuilt all the major indexes on the primary. I have done some research work to address the issue as following. (1) I checked most of the database parameters settings and they are same on both primary and standby, except some specific to the individual server. (2) Checked the explain plan for the offending query and they are exactly same on both the servers. Checked cpu usage on unix box and found it was quite low. (3) The load on standby does not seem to be issue, because with absolutely no load the query takes long and most of the time returned with the conflict error. (4) The hardware settings are exactly same on both primary and secondary. (5) The same query executes very fast on primary (6) After we recovered standby it was fine for few weeks and then again started slowing down. I believe autovacuum and analyze does not need to be run on standby as it inherits that from primary. Please correct me if I am wrong. Any help or suggestion would be greatly appreciated. Thanks,

View this message in context: Hot Standby performance issue
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: Hot Standby performance issue

From
sparikh
Date:
Anybody has any idea, or pointer ? This is a high priority issue I have
resolve at work. Any help would be of great help.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775103.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
Sethu Prasad
Date:


On Fri, Oct 18, 2013 at 11:49 PM, sparikh <sparikh@ecotality.com> wrote:
Anybody has any idea, or pointer ? This is a high priority issue I have
resolve at work. Any help would be of great help.



--
View this message in context: http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775103.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 15.10.2013 22:40, sparikh wrote:
> Hi, This is my first post ever in the Postgres forum. I am
> relatively new to Postgres and coming from oracle background.
>
> We have hot stand by setup to serve mainly the read only queries.
> Past few days we have been facing a performance issues on one of the
>  transaction search. The search mainly utilizes 3 of our biggest
> transaction tables.

What do you mean by "transaction search"?

>
> We had recently crash on both primary and standby because of the
> space issues. Both servers were brought up and running successfully
> after that incident. The standby is in almost in sync with primary,
> far behind by less than a second. I also rebuilt all the major
> indexes on the primary. I have done some research work to address the
> issue as following.
>
> (1) I checked most of the database parameters settings and they are
> same on both primary and standby, except some specific to the
> individual server.

So, what are the basic paremeters?

And what PostgreSQL version / OS / hw are we dealing with?

> (2) Checked the explain plan for the offending query and they are
> exactly same on both the servers. Checked cpu usage on unix box and
> found it was quite low.

Well, that's hardly useful as you haven't provided the query not the
explain plan. Have you tried EXPLAIN ANALYZE?

> (3) The load on standby does not seem to be issue, because with
> absolutely no load the query takes long and most of the time
> returned with the conflict error.

Not suse I understand this. Are you saying that the standby is mostly
idle, i.e. the query seems to be stuck, and then fails with conflict
error most of the time?

> (4) The hardware settings are exactly same on both primary and
> secondary.

So what are these hardware settings? BTW do you have some stats from the
OS (CPU / IO / memory) collected at the time of the performance issue?

> (5) The same query executes very fast on primary

Query? Explain analyze?

> (6) After we recovered standby it was fine for few weeks and then
> again started slowing down.

Was it slowing down gradually, or did it start failing suddenly?

> I believe autovacuum and analyze does not need to be run on standby
> as it inherits that from primary. Please correct me if I am wrong.

Any commands that would modify the database (including vacuum and
autovacuum) are disabled on the standby.

> Any help or suggestion would be greatly appreciated. Thanks,

Please, post as much details as possible. This reports contains pretty
much no such details - query, explain or explain analyze, info about the
settings / hardware etc.

regards
Tomas


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 18.10.2013 23:49, sparikh wrote:
> Anybody has any idea, or pointer ? This is a high priority issue I
> have resolve at work. Any help would be of great help.

To help you we really need much more specific report. The one you posted
contains no details whatsoever - it doesn't even mention what version of
PostgreSQL you use, on what OS or the query.

Please, provide substantially more details. We can't really help you
without it.

regards
Tomas

PS: Please, format your messages reasonably, i.e. not one huge paragraph
of text.


Re: Hot Standby performance issue

From
sparikh
Date:
Hi Tomas,

Thanks so much for your response and sorry for not providing the enough
details.

I have attached the zip file which has query,explain plan and database
parameter settings for both primary and secondary.

Please note that query has multiple unions only the first query on top is
causing the performance issue.

Transaction search is one of the feature in our Admin user interface(web
portal) where user can search for the transactions against our OLTP
database. The attached query is generated dynamically by the application.

> (3) The load on standby does not seem to be issue, because with
> absolutely no load the query takes long and most of the time
> returned with the conflict error.

Not suse I understand this. Are you saying that the standby is mostly
idle, i.e. the query seems to be stuck, and then fails with conflict
error most of the time?

The standby is not idle all the time. What I meant was even with no user
activity or no active user sessions, if I issue the query directly from
pgadmin tool it takes for ever.

Hardware settings both primary and secondary :
===================================

Red Hat Enterprise Linux Server release 5.5 (Tikanga)
Linux 2.6.18-194.26.1.el5 x86_64
4 CPUs
16 GB RAM
Intel Xeon

Postgresql Version:
=================
 "PostgreSQL 9.1.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2
20080704 (Red Hat 4.1.2-51), 64-bit"

6) After we recovered standby it was fine for few weeks and then
> again started slowing down.

Was it slowing down gradually, or did it start failing suddenly?

Honestly speaking I do not exactly, when users started reporting the issue I
started looking into it. But the performance was good in September and
somewhere in October it started slowing down. I guess it was gradual. There
were no code change in the application or major change in the data volume.

Hope this helps. Please let me know if you need any other details.

Thanks Again.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775123.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
Hi,

On 19.10.2013 02:53, sparikh wrote:
> Hi Tomas,
>
> Thanks so much for your response and sorry for not providing the enough
> details.
>
> I have attached the zip file which has query,explain plan and database
> parameter settings for both primary and secondary.

I see no attachment, so it got lost somehow. Can you post the query on
explain.depesz.com and the query inline? Assuming it's not a horror-like
story produced by some ORM, in that case an attachment is probably
appropriate, not to cause hear-attack to casual readers.

> Please note that query has multiple unions only the first query on top is
> causing the performance issue.

So, if you run only the problematic part, is it faster / slower. If it's
equally slow then we can deal with a simpler query, right?

> Transaction search is one of the feature in our Admin user interface(web
> portal) where user can search for the transactions against our OLTP
> database. The attached query is generated dynamically by the application.

I.e. it's a form in the UI, and the application generates query matching
some fields in a form.

>> (3) The load on standby does not seem to be issue, because with
>> absolutely no load the query takes long and most of the time
>> returned with the conflict error.
>
> Not suse I understand this. Are you saying that the standby is mostly
> idle, i.e. the query seems to be stuck, and then fails with conflict
> error most of the time?
>
> The standby is not idle all the time. What I meant was even with no user
> activity or no active user sessions, if I issue the query directly from
> pgadmin tool it takes for ever.

IIRC I was unable to parse your description of what's happening on the
standby (might be my fault, as I'm not a native speaker).

I'm still not sure whether the query is stuck or only processing the
data very slowly. Anyway, instead of describing what's happening, could
you collect some data using vmstat/iostat and post it here? Something like

   vmstat 1

and

   iostat -x -k 1

collected while executing the query. Give us ~15-30 seconds of data for
each, depending of how much it fluctuates. Another option is to use
'perf' to monitor the backend executing your query. Something like

   perf record -g -a -p PID

for a reasonable time, and then 'perf report' (there's like a zillion of
options available for perf, but this should give you some idea where
most of the time is spent).

If you have strong stomach, you might even use strace ...

> Hardware settings both primary and secondary :
> ===================================
>
> Red Hat Enterprise Linux Server release 5.5 (Tikanga)
> Linux 2.6.18-194.26.1.el5 x86_64
> 4 CPUs
> 16 GB RAM
> Intel Xeon

OK. But that's only the HW+OS. What about the basic database parameters
that you have mentioned checking? Say, shared buffers, work mem and such
things?

> Postgresql Version:
> =================
>  "PostgreSQL 9.1.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2
> 20080704 (Red Hat 4.1.2-51), 64-bit"

Any particular reason why you're running a minor version that's 2 years
old and there were numerous bugs fixed since then? Including a serious
security issue in 9.1.9.

Please, plan an update to 9.1.10 ASAP. It might easily be the case that
you're hitting a bug that was already fixed a long time ago.

> 6) After we recovered standby it was fine for few weeks and then
>> again started slowing down.
>
> Was it slowing down gradually, or did it start failing suddenly?
>
> Honestly speaking I do not exactly, when users started reporting the issue I
> started looking into it. But the performance was good in September and
> somewhere in October it started slowing down. I guess it was gradual. There
> were no code change in the application or major change in the data volume.

IMHO that's pretty important fact. If your monitoring or (at least) logs
can't give you an answer, then you should improve that. At least start
logging slow queries and deploy at least some very basic monitoring
(e.g. Munin is quite simple and the PostgreSQL plugin, among other
things, collects data about replication conflicts, which might be the
culprit here).

> Hope this helps. Please let me know if you need any other details.


regards
Tomas


Re: Hot Standby performance issue

From
Kevin Grittner
Date:
sparikh <sparikh@ecotality.com> wrote:

> PostgreSQL 9.1.1

You really should apply the fixes for bugs and security
vulnerabilities which are available.  Some of those may address a
relevant performance problem.

http://www.postgresql.org/support/versioning/

> the performance was good in September and somewhere in October it
> started slowing down. I guess it was gradual. There were no code
> change in the application or major change in the data volume.

We really need to see your configuration, along with a description
of the machine.  One common cause for such a slowdown is not having
autovacuum configured to be aggressive enough.  Another possible
cause is a transaction which has been left open for too long.  Look
at pg_stat_activity and pg_prepared_xacts for xact_start or
prepared more than an hour old.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Hot Standby performance issue

From
sparikh
Date:
Thanks so much Tomas and Kevin for your valuable inputs. I am getting very
good response from this forum and learning so many new stuffs. I will try
all those options and will let you update .


standby_performance_issue.rar
<http://postgresql.1045698.n5.nabble.com/file/n5775181/standby_performance_issue.rar>


On further digging I found from the new relic report that as soon as I
execute query IO spikes immediately (100%). But the same query on primary
executes very fast.

I am not sure if postgres has some utility like what oracle's tkprof or AWR
where I can exactly pin point where exactly the query spends time. I will
try Tomas' suggestions perf and strace.

Below is the query. I also tried to attached rar file  one more time,
hopefully it gets through this time.

SELECT xfer_id, transaction_type, evse_time, transmit_time, error_code,
detail, reported_card_account_number as reported_card_account_number,
event_id,
 event_code,evse_id, batch_id, port, charge_event_id as charge_event_id FROM
(SELECT t.transaction_id::text AS xfer_id, t.transaction_type, e.event_time
AS evse_time,
 t.create_date AS transmit_time, t.error_code::text, '' AS detail,
COALESCE(e.reported_rfid,'N/A') AS reported_card_account_number,
e.event_id::text, e.event_code::text,
 t.evse_id::text, t.batch_id, e.port, COALESCE(e.evse_charge_id,'N/A') AS
charge_event_id
FROM evse_transaction t, evse_event e , evse_unit u
WHERE e.transaction_id = t.transaction_id AND t.evse_id = u.evse_id
AND e.event_code IN

('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36','37','38','39','40','41','42','43','44','45','46','47','48','49'))
 
AND u.evse_id = 1
  AND t.create_date BETWEEN '2013-10-01'::date AND '2013-10-15'::date +
INTERVAL '1 day'
 UNION
 SELECT t.transaction_id::text AS xfer_id, t.transaction_type, t.log_time AS
evse_time, t.create_date AS transmit_time, t.error_code::text, '' AS detail,
 COALESCE(t.reported_card_account_number,'N/A') AS
reported_card_account_number, '' AS event_id, '' AS event_code,
t.evse_id::text,t.batch_id, '' AS port, 'N/A' AS charge_event_id
FROM evse_transaction t, evse_unit u
WHERE t.evse_id = u.evse_id AND t.api_error IS NULL
AND t.transaction_type NOT IN

('DCFCTransactionService','L2TransactionService','EVSEUploadTransactionService','EVSEUploadTransactionService','UploadTransactionService')
 
AND t.transaction_type IN

('DCFCDownloadConfigService','L2DownloadConfigService','EVSEDownloadConfigService','DownloadConfigService','ConfigDownloadService','DCFCUploadConfigService','L2UploadConfigService','EVSEUploadConfigService','UploadConfigService','ConfigUploadService','L2GetAdPackageListService','AdPackageListService','L2GPSService','EVSEGPSService','GPSService','ReportErrorService','EVSEDownloadRevisionService','DCFCCommandService','L2CommandService','CommandService','DCFCErrorService','L2ErrorService','EVSEReportErrorService','ErrorService','DCFCHeartbeatService','L2HeartbeatService','HeartbeatService','DCFCAuthorizeService','L2AuthorizeService','AuthorizeService','DCFCGetAccessListService','L2GetAccessListService','GetAccessListService','DCFCSetAccessService','L2SetAccessService','SetAccessService','DCFCPackageDownloadService','L2PackageDownloadService','PackageDownloadService','DCFCReportInventoryService','L2ReportInventoryService','ReportInventoryService','DCFCTargetVersionService','L2TargetVersionService','TargetVersionService','DCFCPackageListService','L2PackageInfoService','PackageListService','DCFCPackageInfoService','L2PackageInfoService','PackageInfoService','DCFCRegisterService','L2AuthorizeCodeService',
'AuthorizeCodeService')
 AND u.evse_id = 1  AND t.create_date BETWEEN '2013-10-01'::date AND
'2013-10-15'::date + INTERVAL '1 day'
UNION
 SELECT ef.fee_id::text AS xfer_id, 'FEE' as transaction_type, ef.event_time
AS evse_time, ef.create_time AS transmit_time,
'' AS error_code, 'Fee Event' AS detail, COALESCE(ef.card_account_number,
'N/A') AS reported_card_account_number, '' AS event_id, '' AS event_code,
ef.evse_id::text, '' AS batch_id,
ef.port::text AS port, COALESCE(ef.client_charge_id, 'N/A') AS
charge_event_id
FROM evse_fee ef LEFT OUTER JOIN evse_unit eu ON eu.evse_id = ef.evse_id
WHERE ef.evse_id = 1  AND  ef.create_time BETWEEN '2013-10-01'::date AND
'2013-10-15'::date + INTERVAL '1 day'
) x
ORDER BY transmit_time DESC LIMIT 500

==========================================

Query plan:

Limit  (cost=101950.33..101950.40 rows=30 width=368) (actual
time=18.421..18.421 rows=0 loops=1)
  Output: ((t.transaction_id)::text), t.transaction_type, e.event_time,
t.create_date, t.error_code, (''::text), (COALESCE(e.reported_rfid,
'N/A'::text)), ((e.event_id)::text), ((e.event_code)::text),
((t.evse_id)::text), t.batch_id, e.port, (COALESCE(e.evse_charge_id,
'N/A'::text))
  Buffers: shared hit=5 read=7
  ->  Sort  (cost=101950.33..101950.40 rows=30 width=368) (actual
time=18.421..18.421 rows=0 loops=1)
        Output: ((t.transaction_id)::text), t.transaction_type,
e.event_time, t.create_date, t.error_code, (''::text),
(COALESCE(e.reported_rfid, 'N/A'::text)), ((e.event_id)::text),
((e.event_code)::text), ((t.evse_id)::text), t.batch_id, e.port,
(COALESCE(e.evse_charge_id, 'N/A'::text))
        Sort Key: t.create_date
        Sort Method: quicksort  Memory: 25kB
        Buffers: shared hit=5 read=7
        ->  HashAggregate  (cost=101948.99..101949.29 rows=30 width=95)
(actual time=4.414..4.414 rows=0 loops=1)
              Output: ((t.transaction_id)::text), t.transaction_type,
e.event_time, t.create_date, t.error_code, (''::text),
(COALESCE(e.reported_rfid, 'N/A'::text)), ((e.event_id)::text),
((e.event_code)::text), ((t.evse_id)::text), t.batch_id, e.port,
(COALESCE(e.evse_charge_id, 'N/A'::text))
              Buffers: shared hit=5 read=5
              ->  Append  (cost=0.00..101948.01 rows=30 width=95) (actual
time=4.412..4.412 rows=0 loops=1)
                    Buffers: shared hit=5 read=5
                    ->  Nested Loop  (cost=0.00..101163.24 rows=10
width=112) (actual time=4.397..4.397 rows=0 loops=1)
                          Output: (t.transaction_id)::text,
t.transaction_type, e.event_time, t.create_date, t.error_code, ''::text,
COALESCE(e.reported_rfid, 'N/A'::text), (e.event_id)::text,
(e.event_code)::text, (t.evse_id)::text, t.batch_id, e.port,
COALESCE(e.evse_charge_id, 'N/A'::text)
                          Buffers: shared read=4
                          ->  Index Scan using evse_unit_pkey on
public.evse_unit u  (cost=0.00..8.72 rows=1 width=4) (actual
time=4.395..4.395 rows=0 loops=1)
                                Output: u.evse_id
                                Index Cond: (u.evse_id = 123)
                                Buffers: shared read=4
                          ->  Nested Loop  (cost=0.00..101154.22 rows=10
width=112) (never executed)
                                Output: t.transaction_id,
t.transaction_type, t.create_date, t.error_code, t.evse_id, t.batch_id,
e.event_time, e.reported_rfid, e.event_id, e.event_code, e.port,
e.evse_charge_id
                                ->  Index Scan using
evse_transaction_evse_id_create_date_idx on public.evse_transaction t
(cost=0.00..380.04 rows=89 width=65) (never executed)
                                      Output: t.transaction_id,
t.transaction_type, t.create_date, t.error_code, t.evse_id, t.batch_id
                                      Index Cond: ((t.evse_id = 123) AND
(t.create_date >= '2013-10-07'::date) AND (t.create_date <= '2013-10-10
00:00:00'::timestamp without time zone))
                                ->  Index Scan using
evse_event_transaction_idx on public.evse_event e  (cost=0.00..1131.07
rows=98 width=51) (never executed)
                                      Output: e.event_id, e.transaction_id,
e.event_code, e.event_name, e.event_row, e.event_time, e.status,
e.status_detail, e.plug_event_id, e.charge_event_id, e.power_id, e.flow_id,
e.port, e.event_source, e.evse_id, e.reported_rfid, e.evse_charge_id
                                      Index Cond: (e.transaction_id =
t.transaction_id)
                                      Filter: (e.event_code = ANY

('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49}'::integer[]))
                    ->  Nested Loop  (cost=0.00..395.68 rows=18 width=88)
(actual time=0.011..0.011 rows=0 loops=1)
                          Output: (t.transaction_id)::text,
t.transaction_type, t.log_time, t.create_date, t.error_code, ''::text,
COALESCE(t.reported_card_account_number, 'N/A'::text), ''::text, ''::text,
(t.evse_id)::text, t.batch_id, ''::text, 'N/A'::text
                          Buffers: shared hit=2 read=1
                          ->  Index Scan using evse_unit_pkey on
public.evse_unit u  (cost=0.00..8.72 rows=1 width=4) (actual
time=0.010..0.010 rows=0 loops=1)
                                Output: u.evse_id
                                Index Cond: (u.evse_id = 1234)
                                Buffers: shared hit=2 read=1
                          ->  Index Scan using
evse_transaction_evse_id_create_date_idx on public.evse_transaction t
(cost=0.00..386.60 rows=18 width=88) (never executed)
                                Output: t.transaction_id,
t.transaction_type, t.log_time, t.create_date, t.error_code,
t.reported_card_account_number, t.evse_id, t.batch_id
                                Index Cond: ((t.evse_id = 1234) AND
(t.create_date >= '2013-10-07'::date) AND (t.create_date <= '2013-10-10
00:00:00'::timestamp without time zone))
                                Filter: ((t.api_error IS NULL) AND
(t.transaction_type <> ALL

('{DCFCTransactionService,L2TransactionService,EVSEUploadTransactionService,EVSEUploadTransactionService,UploadTransactionService}'::text[]))
AND (t.transaction_type = ANY

('{DCFCDownloadConfigService,L2DownloadConfigService,EVSEDownloadConfigService,DownloadConfigService,ConfigDownloadService,DCFCUploadConfigService,L2UploadConfigService,EVSEUploadConfigService,UploadConfigService,ConfigUploadService,L2GetAdPackageListService,AdPackageListService,L2GPSService,EVSEGPSService,GPSService,ReportErrorService,EVSEDownloadRevisionService,DCFCCommandService,L2CommandService,CommandService,DCFCErrorService,L2ErrorService,EVSEReportErrorService,ErrorService,DCFCHeartbeatService,L2HeartbeatService,HeartbeatService,DCFCAuthorizeService,L2AuthorizeService,AuthorizeService,DCFCGetAccessListService,L2GetAccessListService,GetAccessListService,DCFCSetAccessService,L2SetAccessService,SetAccessService,DCFCPackageDownloadService,L2PackageDownloadService,PackageDownloadService,DCFCReportInventoryService,L2ReportInventoryService,ReportInventoryService,DCFCTargetVersionService,L2TargetVersionService,TargetVersionService,DCFCPackageListService,L2PackageInfoService,PackageListService,DCFCPackageInfoService,L2PackageInfoService,PackageInfoService,DCFCRegisterService,L2AuthorizeCodeService,AuthorizeCodeService}'::text[])))
                    ->  Nested Loop  (cost=0.00..388.80 rows=2 width=80)
(actual time=0.002..0.002 rows=0 loops=1)
                          Output: (t.transaction_id)::text, 'ERROR'::text,
t.create_date, t.create_date, t.error_code, t.transaction_type,
COALESCE(t.reported_card_account_number, 'N/A'::text), ''::text, ''::text,
(t.evse_id)::text, t.batch_id, ''::text, 'N/A'::text
                          Buffers: shared hit=3
                          ->  Index Scan using evse_unit_pkey on
public.evse_unit u  (cost=0.00..8.72 rows=1 width=4) (actual
time=0.002..0.002 rows=0 loops=1)
                                Output: u.evse_id
                                Index Cond: (u.evse_id = 1234)
                                Buffers: shared hit=3
                          ->  Index Scan using
evse_transaction_evse_id_create_date_idx on public.evse_transaction t
(cost=0.00..380.04 rows=2 width=80) (never executed)
                                Output: t.transaction_id, t.create_date,
t.error_code, t.transaction_type, t.reported_card_account_number, t.evse_id,
t.batch_id
                                Index Cond: ((t.evse_id = 1234) AND
(t.create_date >= '2013-10-07'::date) AND (t.create_date <= '2013-10-10
00:00:00'::timestamp without time zone))
                                Filter: (t.api_error IS NOT NULL)
Total runtime: 18.611 ms






--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775181.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
Hi,

On 20.10.2013 19:58, sparikh wrote:
> Thanks so much Tomas and Kevin for your valuable inputs. I am getting
> very good response from this forum and learning so many new stuffs. I
> will try all those options and will let you update .
>
>
> standby_performance_issue.rar
> <http://postgresql.1045698.n5.nabble.com/file/n5775181/standby_performance_issue.rar>

Yup, this time it worked. Anyway, next time please consider posting the
explain plan through explain.depesz.com, it's way more readable than the
plans wrapped when posted inline.

For example this is your plan: http://explain.depesz.com/s/SBVg

However this shows only 18 ms runtime. Is this one of the slow runs? I'm
assuming it's not, because 18ms seems quite fast tome. In that case it's
pretty much useless, because we need to see a plan for one of the slow
queries.

Stupid question - when you say that a query is fast on primary but slow
on standby, are you referring to exactly the same query, including
parameter values?

Or is the query running much longer than the reported 18 ms?

> On further digging I found from the new relic report that as soon as
> I execute query IO spikes immediately (100%). But the same query on
> primary executes very fast.

Well, that might just as well mean that the primary has the data in
filesystem cache, and standby needs to read that from file. If you
repeat the query multiple times, do you still see I/O spike?

How do you use the standby? Is it just sitting there most of the time,
or is it queried about as much as the primary?

BTW when checking the configuration info you've sent, I've noticed this:

  "hot_standby_feedback","off"

IIRC you've reported the query frequently crashes on the standby because
of replication conflicts. Why don't you set this to on?

> I am not sure if postgres has some utility like what oracle's tkprof
> or AWR where I can exactly pin point where exactly the query spends
> time. I will try Tomas' suggestions perf and strace.

No, at least the community version. But the explain analyze is usually a
good source guide.

Tomas


Re: Hot Standby performance issue

From
sparikh
Date:
Stupid question - when you say that a query is fast on primary but slow
on standby, are you referring to exactly the same query, including
parameter values?

Yes . It is exactly and exactly the same query with the same parameters.
Yes, it sounds stupid  but that is what happening. Though plan says it is
18ms it runs for more than 15-20 mins and finally returns with conflict
error : " ERROR: canceling statement due to conflict with recovery "

Even the to run execute plan itself takes very long on standby. Just to get
the execute plan on standby is turning out big deal.

Regarding IO spike, yes I can understand that if data is not available in
the memory then it has to get it from disk. But the thing is it remains
there as much time until query returns with Query conflict error.

Thanks again.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775257.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 21.10.2013 17:05, sparikh wrote:
> Stupid question - when you say that a query is fast on primary but
> slow on standby, are you referring to exactly the same query,
> including parameter values?
>
> Yes . It is exactly and exactly the same query with the same
> parameters. Yes, it sounds stupid  but that is what happening. Though
> plan says it is 18ms it runs for more than 15-20 mins and finally
> returns with conflict error : " ERROR: canceling statement due to
> conflict with recovery "

OK.

> Even the to run execute plan itself takes very long on standby. Just
> to get the execute plan on standby is turning out big deal.

Do you mean EXPLAIN or EXPLAIN ANALYZE?

So far we've seen just EXPLAIN ANALYZE - can you try just EXPLAIN? If it
locks, it's either because of something expensive in the planning, or
locking.

The locking is much more likely, because the primary is behaving just
fine and the resulting plan is exactly the same on both ends.

> Regarding IO spike, yes I can understand that if data is not
> available in the memory then it has to get it from disk. But the
> thing is it remains there as much time until query returns with Query
> conflict error.

I don't think the I/O is a problem at all, because the query takes just
18 milliseconds. However that does not include planning, so either a lot
of time spent waiting for a lock or doing a lot of stuff on CPU, won't
be reported here.

What you can do to debug this is either look at pg_locks on the standby
for connections with "granted=f", or connect using psql and do this

  set log_lock_waits = true;
  set client_min_messages = log;

  EXPLAIN ... query ...;

and it should print what locks the connection is waiting for. Then you
may investigate further, e.g. check who's holding the lock in
pg_stat_activity etc.

But again, I think spending a single minute on this before upgrading to
the current version is a waste of time.

Tomas


Re: Hot Standby performance issue

From
sparikh
Date:
Yes, both Explain and Explain Analyse are taking time. As you suggested I set
the lock parameters, but no locks are observed. Also checked
pg_stat_activity and none of the sessions are either waiting are blocked.

I agree we must upgrade to latest version (9.1.10), but unfortunately kind
of resources (not only man power) we are having it is going to be extremely
challenging task for us. Of course all other options are not working then we
have to take the tough route. No choice.

I am also working with sys admin to rule any issues at the OS or VM level.

Thanks.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775332.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 21.10.2013 23:18, sparikh wrote:
> Yes, both Explain and Explain Analyse are taking time. As you
> suggested I set the lock parameters, but no locks are observed. Also
> checked pg_stat_activity and none of the sessions are either waiting
> are blocked.

Not even the one running the explain? That's weird. Is the backend just
sitting there idle, or is it consuming some CPU?

If it's burning CPU, try to run perf to see where it's spending time.
We've already recommended this at least twice, but I haven't seen any
report yet. This might show if there are any issues with spinlocks
(which don't show in pg_locks etc.).

What about long-open transactions, not necessarily blocked/waiting?

  SELECT xact_start, current_query, waiting
    FROM pg_stat_activity WHERE xact_start ASC LIMIT 10;

> I agree we must upgrade to latest version (9.1.10), but
> unfortunately kind of resources (not only man power) we are having it
> is going to be extremely challenging task for us. Of course all other
> options are not working then we have to take the tough route. No
> choice.

It's not really about this particular issue. As both me and Kevin
pointed out, there are some pretty important ecurity fixes etc. You need
to update irrespectedly of this issue.

BTW minor upgrades (including 9.1.1 -> 9.1.10) are supposed to be
relatively simple, as the the format remains the same. So it's a matter
of shutting down the machine, updating the binaries and starting again.
Of course, I'm not familiar with your setup and requirements.

> I am also working with sys admin to rule any issues at the OS or VM
> level.

OK, good. Post a summary of what you checked.

Tomas


Re: Hot Standby performance issue

From
sparikh
Date:
Yes, Expalin without Analyze is taking long. It is weird. In the
pg_stat_activity Explain was the only query running. So server was almost
idle. Using New relic interface I checked CPU was almost idle - around
10-20%. There were some IO activity - around 40-50%.

I forgot to mention before I could run perf on command line even with root
permission. It says command not found. May be utility is not installed or
not enabled.

I have attached the snapshot of vmstat while explain was running in
background. vmstat.txt
<http://postgresql.1045698.n5.nabble.com/file/n5775349/vmstat.txt>

Do you suggest if I remove all the data files from /data/base folder of
standby and again rebuild using rsync from primary ? do you see any issues
there.? This is just to rule out any fragmentation on standby side.

Our sys admin is planning to run fsck sometime today or tomorrow.

Thanks.









--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775349.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 22.10.2013 00:59, sparikh wrote:
> Yes, Expalin without Analyze is taking long. It is weird. In the
> pg_stat_activity Explain was the only query running. So server was
> almost idle. Using New relic interface I checked CPU was almost idle
> - around 10-20%. There were some IO activity - around 40-50%.
>
> I forgot to mention before I could run perf on command line even with
> root permission. It says command not found. May be utility is not
> installed or not enabled.

Obviously you need to install it ... maybe ask your sysadmin to do that.

> I have attached the snapshot of vmstat while explain was running in
> background. vmstat.txt
> <http://postgresql.1045698.n5.nabble.com/file/n5775349/vmstat.txt>

The vmstat clearly shows that ~1 CPU is waiting on I/O. Hmm, I'm really
wondering what's going on here - I can't think of a case where this
would happen with a plain EXPLAIN ...

We really need the perf results. Or try to run strace, maybe it'll give
more info about which files it accesses.

> Do you suggest if I remove all the data files from /data/base folder
> of standby and again rebuild using rsync from primary ? do you see
> any issues there.? This is just to rule out any fragmentation on
> standby side.

The EXPLAIN really should not do much I/O. I doubt it has anything to do
with fragmentation, so I doubt this is going to help.

> Our sys admin is planning to run fsck sometime today or tomorrow.

OK. Which filesystem do you use, btw?

Tomas


Re: Hot Standby performance issue

From
sparikh
Date:
> Do you suggest if I remove all the data files from /data/base folder
> of standby and again rebuild using rsync from primary ? do you see
> any issues there.? This is just to rule out any fragmentation on
> standby side.

The EXPLAIN really should not do much I/O. I doubt it has anything to do
with fragmentation, so I doubt this is going to help.

Actually I was referring to this in the context of addressing main
underlying performance issue, not EXPLAIN. Sorry, I may not have
communicated it correctly.

Even strance does not seem to be installed.

The filesytem type it shows to me ext3.

Thanks.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775361.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
Magnus Hagander
Date:


On Oct 22, 2013 1:14 AM, "Tomas Vondra" <tv@fuzzy.cz> wrote:
>
> On 22.10.2013 00:59, sparikh wrote:
> > Yes, Expalin without Analyze is taking long. It is weird. In the
> > pg_stat_activity Explain was the only query running. So server was
> > almost idle. Using New relic interface I checked CPU was almost idle
> > - around 10-20%. There were some IO activity - around 40-50%.
> >
> > I forgot to mention before I could run perf on command line even with
> > root permission. It says command not found. May be utility is not
> > installed or not enabled.
>
> Obviously you need to install it ... maybe ask your sysadmin to do that.
>
> > I have attached the snapshot of vmstat while explain was running in
> > background. vmstat.txt
> > <http://postgresql.1045698.n5.nabble.com/file/n5775349/vmstat.txt>
>
> The vmstat clearly shows that ~1 CPU is waiting on I/O. Hmm, I'm really
> wondering what's going on here - I can't think of a case where this
> would happen with a plain EXPLAIN ...

Catalog bloat could make that happen. Though that should show up on the master as well, it could be that it's cached there and therefor only shows us to as cpu and not io and is therefore not noticed.

/Magnus

Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 22.10.2013 06:49, Magnus Hagander wrote:
>
> On Oct 22, 2013 1:14 AM, "Tomas Vondra" <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>> wrote:
>>
>> On 22.10.2013 00:59, sparikh wrote:
>> > Yes, Expalin without Analyze is taking long. It is weird. In the
>> > pg_stat_activity Explain was the only query running. So server was
>> > almost idle. Using New relic interface I checked CPU was almost idle
>> > - around 10-20%. There were some IO activity - around 40-50%.
>> >
>> > I forgot to mention before I could run perf on command line even with
>> > root permission. It says command not found. May be utility is not
>> > installed or not enabled.
>>
>> Obviously you need to install it ... maybe ask your sysadmin to do that.
>>
>> > I have attached the snapshot of vmstat while explain was running in
>> > background. vmstat.txt
>> > <http://postgresql.1045698.n5.nabble.com/file/n5775349/vmstat.txt>
>>
>> The vmstat clearly shows that ~1 CPU is waiting on I/O. Hmm, I'm really
>> wondering what's going on here - I can't think of a case where this
>> would happen with a plain EXPLAIN ...
>
> Catalog bloat could make that happen. Though that should show up on
> the master as well, it could be that it's cached there and therefor
> only shows us to as cpu and not io and is therefore not noticed.

Maybe, but sparikh reported the query to be running for ~20 minutes.
That'd be hell of a bloat.

Sparikh, can you show us the size of system catalogs? Something like

   SELECT relname, relpages FROM pg_class
    WHERE relname LIKE 'pg%'
    ORDER BY relpages DESC LIMIT 20;

Should give the same results both on primary and standby.


Tomas


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 22.10.2013 02:00, sparikh wrote:
>> Do you suggest if I remove all the data files from /data/base folder
>> of standby and again rebuild using rsync from primary ? do you see
>> any issues there.? This is just to rule out any fragmentation on
>> standby side.
>
> The EXPLAIN really should not do much I/O. I doubt it has anything to do
> with fragmentation, so I doubt this is going to help.
>
> Actually I was referring to this in the context of addressing main
> underlying performance issue, not EXPLAIN. Sorry, I may not have
> communicated it correctly.
>
> Even strance does not seem to be installed.

It's 'strace' (aka syscall trace), not 'strance'. Please install both
perf and strace and try to collect some information about the backend
executing the slow query. We're mostly speculating and we need the data.

Try perf first - it's basically a profiler and the results are usually
understandable. Even a simple "perf top" can give us a hint.

Strace is much more low-level and much more difficult to analyze.

> The filesytem type it shows to me ext3.

OK. Not the best filesystem IMHO, but I doubt it's related to the issue
we're discussing here.

Tomas


Re: Hot Standby performance issue

From
sparikh
Date:
From Primary:

relname    relpages
pg_toast_17673    1812819
pg_toast_17594    161660
pg_toast_17972    121902
pg_toast_17587    77190
pg_toast_18537    29108
pg_toast_17578    26638
pg_toast_17673_index    19984
pg_toast_17868    14911
pg_toast_17594_index    2208
pg_toast_1072246    1922
pg_toast_17587_index    1510
pg_toast_17972_index    1399
pg_statistic    911
pg_toast_18694    883
pg_toast_17578_index    375
pg_attribute    336
pg_toast_16475    332
pg_toast_18537_index    321
pg_proc    233
pg_depend_depender_index    176

From Secondary :
============
relname    relpages
pg_toast_17673    1812819
pg_toast_17594    161660
pg_toast_17972    121902
pg_toast_17587    77190
pg_toast_18537    29108
pg_toast_17578    26638
pg_toast_17673_index    19984
pg_toast_17868    14911
pg_toast_17594_index    2208
pg_toast_1072246    1922
pg_toast_17587_index    1510
pg_toast_17972_index    1399
pg_statistic    911
pg_toast_18694    883
pg_toast_17578_index    375
pg_attribute    336
pg_toast_16475    332
pg_toast_18537_index    321
pg_proc    233
pg_depend_depender_index    176

Yes, result looks same both on primary and standby.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775526.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
sparikh
Date:
Sorry, it was typo from my side. I meant strace only.

I will try to request both perf and strace to be installed. But I am not
quite sure as the VMs are managed by third party. Will keep you posted...

The main thing puzzling to me is Explain Plan with Analyze takes couple of
secs to execute the operation but in reality it runs for more than 20 mins.

Thanks.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775529.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 22.10.2013 23:41, sparikh wrote:
>From Primary:
>
> relname    relpages
> pg_toast_17673    1812819
> pg_toast_17594    161660
> pg_toast_17972    121902
> pg_toast_17587    77190
> pg_toast_18537    29108
> pg_toast_17578    26638
> pg_toast_17673_index    19984
> pg_toast_17868    14911
> pg_toast_17594_index    2208
> pg_toast_1072246    1922
> pg_toast_17587_index    1510
> pg_toast_17972_index    1399
> pg_statistic    911
> pg_toast_18694    883
> pg_toast_17578_index    375
> pg_attribute    336
> pg_toast_16475    332
> pg_toast_18537_index    321
> pg_proc    233
> pg_depend_depender_index    176
>
>From Secondary :
> ============
> relname    relpages
> pg_toast_17673    1812819
> pg_toast_17594    161660
> pg_toast_17972    121902
> pg_toast_17587    77190
> pg_toast_18537    29108
> pg_toast_17578    26638
> pg_toast_17673_index    19984
> pg_toast_17868    14911
> pg_toast_17594_index    2208
> pg_toast_1072246    1922
> pg_toast_17587_index    1510
> pg_toast_17972_index    1399
> pg_statistic    911
> pg_toast_18694    883
> pg_toast_17578_index    375
> pg_attribute    336
> pg_toast_16475    332
> pg_toast_18537_index    321
> pg_proc    233
> pg_depend_depender_index    176
>
> Yes, result looks same both on primary and standby.

Yes. And it also shows that the really interesting tables (e.g.
pg_class) are not bloated.

Tomas


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 22.10.2013 23:50, sparikh wrote:
> Sorry, it was typo from my side. I meant strace only.

OK.

>
> I will try to request both perf and strace to be installed. But I am
> not quite sure as the VMs are managed by third party. Will keep you
> posted...

What do you mean by VM? Is this a virtualized environment or bare hardware?

>
> The main thing puzzling to me is Explain Plan with Analyze takes
> couple of secs to execute the operation but in reality it runs for
> more than 20 mins.

So, now I'm getting confused. Can you please provide timings for each
case. I.e. how long it takes to execute

1) plain query
2) explain query
3) explain analyze query

Execute each of those a couple of times, and let us know about
significant variations.

It should always be

       EXPLAIN ANALYZE >= query >= EXPLAIN

If you're reporting you see "EXPLAIN ANALYZE < query" then I find that
really strange.

Tomas


Re: Hot Standby performance issue

From
sparikh
Date:
I will try to request both perf and strace to be installed. But I am
> not quite sure as the VMs are managed by third party. Will keep you
> posted...

What do you mean by VM? Is this a virtualized environment or bare hardware?

Yes, they are virtualized environments.

Sorry about the confusion. But I was just telling from based on the explain
plan report. e.g at the bottom of explain plan report it says "Total
runtime: 1698.453 ms" (This is with analyze option).

But from the client perspective (either run from pgadmin or directly from
the server command line) it takes more that 20 min to display the output.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775550.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
ramistuni
Date:
Today morning I found that the performance issue on standby database was
fixed by itself. On further investigation I found that one of the biggest
used in this query had autovacuum  kicked in yesterday on primary. The last
time it had autovaccum ran was on Sep 30th.

I am suspecting that this should have been fixed the issue. The table has
update and delete operations. Only thing I did not understand why postgres
did not pick this table for autovacuum all these days, in spite of this
table is one of the busiest table from DML perspective. I was monitoring the
primary database activity all these days and always could see autovacuum was
running on another tables.

Thanks.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5775972.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 25.10.2013 23:22, ramistuni wrote:
> Today morning I found that the performance issue on standby database
> was fixed by itself. On further investigation I found that one of the
> biggest used in this query had autovacuum  kicked in yesterday on
> primary. The last time it had autovaccum ran was on Sep 30th.
>
> I am suspecting that this should have been fixed the issue. The table
> has update and delete operations. Only thing I did not understand why
> postgres did not pick this table for autovacuum all these days, in
> spite of this table is one of the busiest table from DML perspective.
> I was monitoring the primary database activity all these days and
> always could see autovacuum was running on another tables.

Hi,

I don't see how this would explain the issues you've reported, i.e.
query running fast on primary and very slow on standby. That suggests
the problem is somehow connected to the replication conflict resolution.
However I don't see a reason why a query should take so much longer
before failing due to a conflict.

To find out why the autovacuum didn't trigger on the largest/busiest
table, you should probably check your logs for autovacuum failures
and/or cancels.

Then we'll need to know the basic features of the table (most
importantly how many rows are there), and autovacuum thresholds. It's
possible that the table is simply way bigger than the other tables, and
thus it takes more time to accumulate enough "dead rows" to trigger
autovacuum. Or maybe most of the cleanup tasks was handled by HOT, i.e.
not requiring a vacuum at all.

I think you need to check these fields in pg_stat_all_tables

  SELECT n_live_tup, n_dead_tup,
         n_tup_ins, n_tup_upd, n_tup_del, n_tup_hot_upd
    FROM pg_stat_all_tables
   WHERE relname = '... tablename ...'


regards
Tomas


Re: Hot Standby performance issue

From
sparikh
Date:
Hi,

Yes, you are right. The table is the biggest one . Please find below the
information you requested. I agree the fact that autovacuum ran on this
table would fix the performance issue on standby does not sound very
convincing. But that is the only thing I could correlate when the query on
standby started working again. Otherwise there is absolutely no changes at
code level , database level or OS level.
As of now query is still working fine on standby.

I may be wrong, but could it be the case that standby disk was too much
fragmented compare to primary and autovaccum on primary fixed that.
(Assuming autovacuum on primary internally triggers the same on standby)


Sequential Scans    18
Sequential Tuples Read    1355777067
Index Scans    102566124
Index Tuples Fetched    67155748
Tuples Inserted    16579520
Tuples Updated    17144291
Tuples Deleted    24383607
Tuples HOT Updated    1214531
Live Tuples    101712125
Dead Tuples    3333207
Heap Blocks Read    420703920
Heap Blocks Hit    496135814
Index Blocks Read    66807468
Index Blocks Hit    916783267
Toast Blocks Read    310677
Toast Blocks Hit    557735
Toast Index Blocks Read    6959
Toast Index Blocks Hit    936473
Last Vacuum
Last Autovacuum    2013-10-25 02:47:09.914775-04
Last Analyze
Last Autoanalyze    2013-10-25 18:39:25.386091-04
Vacuum counter    0
Autovacuum counter    2
Analyze counter    0
Autoanalyze counter    4
Table Size    46 GB
Toast Table Size    615 MB
Indexes Size    20 GB



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5776156.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
sparikh
Date:
Table statistics I sent before were from primary. Following are from standby.

Index Tuples Fetched    25910277
Tuples Inserted    0
Tuples Updated    0
Tuples Deleted    0
Tuples HOT Updated    0
Live Tuples    0
Dead Tuples    0
Heap Blocks Read    138482386
Heap Blocks Hit    1059169445
Index Blocks Read    4730561
Index Blocks Hit    9702556
Toast Blocks Read    1165
Toast Blocks Hit    82
Toast Index Blocks Read    85
Toast Index Blocks Hit    3055
Last Vacuum
Last Autovacuum
Last Analyze
Last Autoanalyze
Vacuum counter    0
Autovacuum counter    0
Analyze counter    0
Autoanalyze counter    0
Table Size    46 GB
Toast Table Size    615 MB
Indexes Size    20 GB



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Hot-Standby-performance-issue-tp5774673p5776160.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 28.10.2013 21:57, sparikh wrote:
> Table statistics I sent before were from primary. Following are from standby.
>
> Index Tuples Fetched    25910277
> Tuples Inserted    0
> Tuples Updated    0
> Tuples Deleted    0
> Tuples HOT Updated    0
> Live Tuples    0
> Dead Tuples    0
> Heap Blocks Read    138482386
> Heap Blocks Hit    1059169445
> Index Blocks Read    4730561
> Index Blocks Hit    9702556
> Toast Blocks Read    1165
> Toast Blocks Hit    82
> Toast Index Blocks Read    85
> Toast Index Blocks Hit    3055
> Last Vacuum
> Last Autovacuum
> Last Analyze
> Last Autoanalyze
> Vacuum counter    0
> Autovacuum counter    0
> Analyze counter    0
> Autoanalyze counter    0
> Table Size    46 GB
> Toast Table Size    615 MB
> Indexes Size    20 GB

Why have you skipped some of the rows posted for primary? E.g. the
sequential scans info?

Anyway, I think new data are not going to help us as the issue resolved
somehow, so the current data are unlikely to show the original cause.

You can either wait whether it happens again, or dig in the logs to see
if / why the autovacuum was not running on this table.

Tomas


Re: Hot Standby performance issue

From
Tomas Vondra
Date:
On 28.10.2013 21:23, sparikh wrote:
> Hi,
>
> Yes, you are right. The table is the biggest one . Please find below the
> information you requested. I agree the fact that autovacuum ran on this
> table would fix the performance issue on standby does not sound very
> convincing. But that is the only thing I could correlate when the query on
> standby started working again. Otherwise there is absolutely no changes at
> code level , database level or OS level.
> As of now query is still working fine on standby.
>
> I may be wrong, but could it be the case that standby disk was too much
> fragmented compare to primary and autovaccum on primary fixed that.
> (Assuming autovacuum on primary internally triggers the same on standby)

I find it very unlikely, but you didn't gave us necessary data (say, how
much free space was on the disks, etc.). The best way to pinpoint the
issue would be to run some profiler (which we have repeatedly asked you
to do), but now that the issue disappeared we can only guess.

Please monitor the system and if it happens again run perf or other
profiler so that we know where the time is spent.

Tomas