Thread: Delay in completion of aggregation inserts when run in a single commit - PG 9.1.2

Hi All,

 

We have a reporting application to consume transactional log created by a billing system. The flow of data in ETL framework is from TRANSACTION_LOG->STAGING_TABLES->FACT_TABLES->AGGREGATE_TABLES.

 

There are around 15 aggregate tables to be populated from these staging tables. Number of records for a one time will be 1.2 million records on an average. We are expecting anywhere near to 10 such processes in a day.

 

Problem:

-          Staging table population happens in expected time. Anywhere from 10-15 minutes for each process.

-          Aggregation process almost never completes which is a set of 15 insert queries. Entire process happens in a single commit (population of staging tables and aggregation).

-          When I run these 15 queries individually, it happens in no time.

 

Could you be kind enough to explain how differently queries run when in a “single commit” as opposed to running each of them individually?

 

----

 

The system is CentOS 5.5, Postgres installed is 9.1.2.

 

There are 3 postgres servers installed, each at 5432, 5433 and 5434 ports. It is a 16GB RAM with 3 NetAPP storage box mounted with 500GB each.

 

Postgresql.conf remains untouched, except for log writing which has been made to “all”.

 

----

 

Any info greatly appreciated.

 

Thanks in advance,

Akash.

 

 

 

 




DISCLAIMER: The information in this message is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, or distribution of the message, or any action or omission taken by you in reliance on it, is prohibited and may be unlawful. Please immediately contact the sender if you have received this message in error. Further, this e-mail may contain viruses and all reasonable precaution to minimize the risk arising there from is taken by OnMobile. OnMobile is not liable for any damage sustained by you as a result of any virus in this e-mail. All applicable virus checks should be carried out by you before opening this e-mail or any attachment thereto.
Thank you - OnMobile Global Limited.
On 07/05/2012 02:30 AM, Akash Kodibail wrote:
> -Staging table population happens in expected time. Anywhere from
> 10-15 minutes for each process.
>
> -Aggregation process almost never completes which is a set of 15
> insert queries. Entire process happens in a single commit (population
> of staging tables and aggregation).
>
When you mean "never completes" - can you be more specific?

Is the query not progressing - say, waiting on a lock?

Is it doing work, but performing very poorly?

Does it terminate with a failure? If so, error message?

If the query just keeps running but you're not sure what it is or isn't
doing, you should examine pg_stat_activity and pg_locks. From
pg_stat_activity you can get the backend process ID; examine that
backend using system tools (ps, top, etc) to see if it's using CPU and
whether it's doing disk I/O. From pg_locks you can get an idea of
whether a query is stuck waiting on a lock.
> Could you be kind enough to explain how differently queries run when
> in a “single commit” as opposed to running each of them individually?
There shouldn't be tons of difference. The first queries continue to
hold locks while the others run, but since you're in the same
transaction it won't affect subsequent queries, only concurrent ones.

If you have other concurrent work going on it's possible you're running
into locking issues that you don't encounter when you release locks with
individual commits.

Also, some resources can only be released when a transaction commits. If
you're doing something like using huge numbers of savepoints or PL/PgSQL
BEGIN ... EXCEPT blocks it's possible you're running into resource
issues. The SAVEPOINT code has improved massively since 8.3 when I had
similar issues, but savepoints still aren't free.

Start by looking at the locking situation, and by determining whether
your query is making slow forward progress or is simply waiting.
>
> Postgresql.conf remains untouched, except for log writing which has
> been made to “all”.
>
That's going to be VERY sub-optimal if you're bulk-loading data. You'll
need to push checkpoint_segments right up. Check your logs; you're
probably seeing warnings about checkpoints being too frequent. Hell, my
checkpoint_segments is at 50 on my *laptop* - albeit a very fast
SSD-equipped laptop that does lots of blk data uinit testing.

You'll want to push effective_cache_size up to 4 or 5 GB given you have
three instances. Hard to say re shared_buffers, I have little experience
with running multiple instances on one machine.

--
Craig Rin

Akash Kodibail wrote:
> Problem:
[...]
>
> -  Aggregation process almost never completes which is a set of 15
insert queries. Entire
> process happens in a single commit (population of staging tables and
aggregation).
>
> -  When I run these 15 queries individually, it happens in no time.
>
> Could you be kind enough to explain how differently queries run when
in a "single commit" as opposed
> to running each of them individually?

Apart from minor differences (current_timestamp, ...) the
main difference is that there is no COMMIT after each query.
COMMIT is expensive since it requires a write to hard disk.

Try to get the execution plans of the queries in both cases
and see if you can spot a difference.
The auto_explain contrib module might come handy for that.

Yours,
Laurenz Albe

On 07/05/2012 04:00 PM, Albe Laurenz wrote:
> Akash Kodibail wrote:
>> Problem:
> [...]
>> -  Aggregation process almost never completes which is a set of 15
> insert queries. Entire
>> process happens in a single commit (population of staging tables and
> aggregation).
>> -  When I run these 15 queries individually, it happens in no time.
>>
>> Could you be kind enough to explain how differently queries run when
> in a "single commit" as opposed
>> to running each of them individually?
> Apart from minor differences (current_timestamp, ...) the
> main difference is that there is no COMMIT after each query.
> COMMIT is expensive since it requires a write to hard disk.
>
> Try to get the execution plans of the queries in both cases
> and see if you can spot a difference.

Actually, that's a thought. Autovaccum can't see uncomitted work, and
thus can't analyze it. Maybe they need an explicit ANALYZE or two after
a bulk insert or update during their data load.

--
Craig Ringer

@@Craig

I could only wait for 6 hours. I cancelled the process after that.

No locks on any of the tables when running.

Yes, It was progressing. Of the 15 tables, when I cancelled after 6 hours,
3rd table was getting populated (when checked in pg_stat_activity).

Below are some changes I have in postgresql.conf, some based on your
suggestions:

1) constraint_exclusion = partition, which I had already made, supported by
an trigger on every insert with around 100 checks for each table (date
constraints).
2)log_destination = 'stderr'
3)logging_collector = on
4)autovacuum = off, we are doing daily vacuum analyze on all tables.
5)shared_buffers = 32MB
6)max_prepared_transactions = 100
7)work_mem = 16MB
8)maintenance_work_mem = 64MB
9)wal_writer_delay = 1000ms
10)checkpoint_segments = 16

After these changes process is moving forward, but somehow I feel that, It
is only matter of time, I will reach the next choking point. Since I am
seeing the that process completion is taking a little longer each time I run
it.

Also a thing I noticed is: When I run the query:

select relname, age(relfrozenxid) from pg_class order by age(relfrozenxid)
desc;

                     relname                      |    age
--------------------------------------------------+------------
 pg_toast_28344_index                             | 2147483647
 pg_toast_28351_index                             | 2147483647
 pg_toast_33106_index                             | 2147483647
 pg_toast_33099_index                             | 2147483647
 pg_toast_32128_index                             | 2147483647
 pg_toast_28232_index                             | 2147483647
 pg_toast_33092_index                             | 2147483647
 promo_seq                                        | 2147483647
 pg_toast_33085_index                             | 2147483647
 pg_toast_32135_index                             | 2147483647
 pg_toast_33120_index                             | 2147483647
...... There are many more.

Regards,
Akash.

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Delay-in-completion-of-aggregation-inserts-when-run-in-a-single-commit-PG-9-1-2-tp5715391p5715621.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.

On 07/06/2012 01:06 PM, Akash wrote:
@@Craig

I could only wait for 6 hours. I cancelled the process after that.

No locks on any of the tables when running.
That's a very ambiguous statement. I'm assuming you are saying "our code does not take any explict locks on those tables using LOCK TABLE or SELECT ... FOR SHARE / FOR UPDATE" - because there most certainly are locks on those tables whenever you're running a query against them. Even a simple SELECT takes out a lock to prevent the table from being dropped while the query runs.

There could be no locks taken by transactions other than the transaction doing the run, though.

Yes, It was progressing. Of the 15 tables, when I cancelled after 6 hours,
3rd table was getting populated (when checked in pg_stat_activity).
OK. If you turn auto_explain on and use it to get a plan for the slow queries, do those plans differ from the plans produced when running the same queries standalone?

Did you examine iostat? Did you look at what work the postgres process was doing to see if it was busy with cpu or I/O (iowait) or if it was idle waiting for something else to happen?

4)autovacuum = off, we are doing daily vacuum analyze on all tables.
No!

Unless your database as no write activity other than these batch loads, you should have autovacuum on. The more frequently autovaccum runs the better a job it will do of preventing bloat, maintaining table stats, etc.

If you have tables that are only ever modified by a bulk-load script that then immediately vacuums them, set the autovac parameters for that table so it's excluded, don't turn autovaccum off entirely.


5)shared_buffers = 32MB
That's tiny and will severely constrain Pg's resources.

I notice you didn't set effective_cache_size either.
After these changes process is moving forward, but somehow I feel that, It
is only matter of time, I will reach the next choking point. Since I am
seeing the that process completion is taking a little longer each time I run
it.
Well, your tables and indexes are probably bloating horribly because autovaccum is turned off, so that's not surprising.

If you drop and recreate, or TRUNCATE, the tables between load runs you might be OK with autovac off for those tables, but what you're describing makes me think otherwise.

--
Craig Ringer

Error: could not send data to client: Connection reset by peer

From
Eduardo Sá dos Reis
Date:
Hi,
 
In my Postgres log the message is occurring:
 
1 4ffad9f4.67db 2012-07-09 10:17:40 BRT LOG:  connection received: host=10.20.0.1 port=2368
2 4ffad9f4.67db 2012-07-09 10:17:40 BRT LOG:  connection authorized: user=usu_cad database=prd_cad
3 4ffad9f4.67db 2012-07-09 10:17:42 BRT LOG:  could not send data to client: Connection reset by peer
4 4ffad9f4.67db 2012-07-09 10:17:42 BRT LOG:  unexpected EOF on client connection
 
Users can not run the system.
 
Does anyone know what might be happening?
 
Thanks.
 
Eduardo Reis

Re: Error: could not send data to client: Connection reset by peer

From
Craig Ringer
Date:
On 07/09/2012 09:26 PM, Eduardo Sá dos Reis wrote:
Hi,
 
In my Postgres log the message is occurring:
 
1 4ffad9f4.67db 2012-07-09 10:17:40 BRT LOG:  connection received: host=10.20.0.1 port=2368
2 4ffad9f4.67db 2012-07-09 10:17:40 BRT LOG:  connection authorized: user=usu_cad database=prd_cad
3 4ffad9f4.67db 2012-07-09 10:17:42 BRT LOG:  could not send data to client: Connection reset by peer
4 4ffad9f4.67db 2012-07-09 10:17:42 BRT LOG:  unexpected EOF on client connection
 
Users can not run the system.

If this is a Windows system with a 3rd party software firewall (Windows Firewall is OK, anything else isn't) then remove or disable the firewall and see if the problem goes away.

Otherwise, more info please. PostgreSQL version, OS and version, client program, whether the client program is using psqlODBC, pgJDBC, libpq directly, etc.

--
Craig Ringer