Large object loading stalls - Mailing list pgsql-general
From | Michael Akinde |
---|---|
Subject | Large object loading stalls |
Date | |
Msg-id | 499C0CCD.7010304@met.no Whole thread Raw |
Responses |
Re: Large object loading stalls
|
List | pgsql-general |
Hi, We have been running into a strange problem on our production servers of a system we are about to roll out with the loading of large objects. At various times (usually separated by a few days), the loading process against the server will suddenly either take an extremely long time (going from the usual 15-20 minutes up to 45 or more minutes) or simply hang up altogether (i.e., when 8+ hours have gone by, we haven't had the patience to wait and see if it eventually finishes). This is on an 8.3.3 server running on debian etch - most of the previous testing occurred on 8.1.10, where we did not encounter this problem (though we had other issues on that platform). I have run some tests with the loading of large objects into a PostgreSQL database (8.3.3) using a more rapid turnaround (loading, deleting, vacuuming directly and then starting the loading over again) than in production and have been able to provoke what seems to be the same issue there - usually within 12 hours. Essentially, we have a C++ program (pqxx) loading a large number of large objects (somewhere around a 80,000 or so) exhibiting strange behavior. It varies quite a lot in loading time (from 1600 to 2500 seconds), and at random times it will simply just hang up (at least it seems so, given that it has not progressed at all after 7-8 hours). It displays this behaviour irrespective of whether we load all 80000 fields in one go, or the loading is broken up into smaller continuos bits of 1000 blobs at a time (though it does seem easier to provoke the problem with the former loading approach). It can run fine for hours at a time, and then suddenly lock up (in the latest test I did, it loaded the data correctly 22 times in a row, before locking up on the 23rd run through). The pattern seems to be identical each time; ps aux shows the process (wdb) idle in transaction, with no activity on the system being apparent: postgres 19965 0.0 0.0 72628 5584 ? S Feb17 0:01 /usr/lib/postgresql/8.3/bin/postgres -D /var/lib/postgresql/8.3/main -c config_file=/etc/postgresql/8.3/main/postgresql.conf postgres 19967 0.4 0.0 72748 27636 ? Ss Feb17 6:38 postgres: writer process postgres 19968 0.0 0.0 72628 1784 ? Ss Feb17 0:09 postgres: wal writer process postgres 19969 0.0 0.0 72940 1924 ? Ss Feb17 0:00 postgres: autovacuum launcher process postgres 19970 0.0 0.0 42352 1736 ? Ss Feb17 0:02 postgres: stats collector process postgres 16617 0.0 0.0 95156 10948 ? Ss 08:42 0:00 postgres: autovacuum worker process wdb postgres 18834 0.0 0.1 118632 37320 ? Ss 09:06 0:01 postgres: wdb wdb [local] idle in transaction A core dump on the program we did earlier seemed to indicate that the program was waiting on the database to return on the insert statement. top indicates that both postgres and the application are idling (no load on the server whatsoever). pg_locks and pg_stat_activity seem to indicate that there may perhaps be some issue with autovacuum? locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted ------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+--------- relation | 16385 | 2683 | | | | | | | | 2/1122117 | 16617 | RowExclusiveLock | t relation | 16385 | 10969 | | | | | | | | 3/1105 | 27895 | AccessShareLock | t relation | 16385 | 2613 | | | | | | | | 2/1122117 | 16617 | ShareUpdateExclusiveLock | t virtualxid | | | | | 3/1105 | | | | | 3/1105 | 27895 | ExclusiveLock | t virtualxid | | | | | 2/1122117 | | | | | 2/1122117 | 16617 | ExclusiveLock | t (5 rows) datid | datname | procpid | usesysid | usename | current_query | waiting | xact_start | query_start | backend_start | client_addr | client_port -------+---------+---------+----------+----------+------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------------+------------- 16385 | wdb | 16617 | 10 | postgres | autovacuum: VACUUM ANALYZE pg_catalog.pg_largeobject | f | 2009-02-18 08:42:45.066922+00 | 2009-02-18 08:42:45.066922+00 | 2009-02-18 08:42:45.065857+00 | | 16385 | wdb | 27895 | 16384 | wdb | select * from pg_stat_activity ; | f | 2009-02-18 12:21:47.925813+00 | 2009-02-18 12:21:47.925813+00 | 2009-02-18 11:09:31.201507+00 | | -1 16385 | wdb | 18834 | 16384 | wdb | <IDLE> in transaction | f | | 2009-02-18 09:06:26.479633+00 | 2009-02-18 09:06:23.443698+00 | | -1 (3 rows) Any leads as to how to figure out the problem or alternatively how to work around it would be much appreciated. I noticed that the disk on the test machine seemed to be running low on disk caused by an error in the fsm settings, but this was not the case in the production systems, so I assume that this is not the problem here either (will fix it for future test runs, though). Regards, Michael Akinde
Attachment
pgsql-general by date: