Thread: How do I track down a possible locking problem?

How do I track down a possible locking problem?

From

Herouth Maoz

Date:

17 February 2014, 19:46:07

I have a production system using Postgresql 9.1.2.

The system basically receives messages, puts them in a queue, and then several parallel modules, each in its own
thread,read from that queue, and perform two inserts, then release the message to the next queue for
non-database-relatedprocessing.

Today, after we received complaints from a customer about delays, I noticed odd behavior in the system. About 2 percent
ofthe messages were inserted into the tables more than an hour after they got into the system.

The queue never has more than 27,000 messages at the same time, and all together, the parallel modules process about
5000or 6000 messages per minute. So basically, the delay for a single message should never be more than a few minutes.
Evenif one module gets stuck, another will grab the next message from the queue. I believe the only way for a message
tobe stuck for so long would be for it to be grabbed by a module, and then for the database write to be somehow delayed
fora hour, although it's merely a simple insert performed with a prepared statement.

The database in production is very busy with millions of writes per hour. Could there be a situation in which a
particularconnection gets "starved" while other connections are able to run queries without noticeable delay?

How can I truck such locks down? Does anybody have any ideas other than starvation? The system lets me view statistics
ofhow many messages were processed in each modules and the average latency. None of the four modules running has long
averagelatency or low number of messages processes, so I don't think the issue is related to any particular thread in
my(Java) system being slow or locked away by the others.

TIA,
Herouth

Re: How do I track down a possible locking problem?

From

Herouth Maoz

Date:

18 February 2014, 18:14:03

Is there a more appropriate place to ask this question? Or was my question unclear?

I dug some data, and it seems that whenever messages come at a rate of 75,000 per hour, they start picking delays of up to 10 minutes. If I go up to 100,000, delays pick up to about 20 minutes. And for 300,000 in one hour, I get delays of up to 3 hours or so.

Typically, during an hour in which 250,000 messages were processed, around 10% of them are delayed more than 20 minutes, and some for more than an hour.

Can anybody offer any insight? Do any of you get delays in inserts performed at this rate (250,000 per hour), meaning that the insert transaction takes more than 10 minutes? Is it a matter of fine-tuning the server? Note that at the same time there are other processes who also perform updates on the same tables, at about the same rate. So each of the tables gets a write about 500,000 times per hour. The table normally contains around 2-3 million records, and has 3 indexes.

Thank you,

Herouth

On 17/02/2014, at 18:45, Herouth Maoz wrote:

I have a production system using Postgresql 9.1.2.

The system basically receives messages, puts them in a queue, and then several parallel modules, each in its own thread, read from that queue, and perform two inserts, then release the message to the next queue for non-database-related processing.

Today, after we received complaints from a customer about delays, I noticed odd behavior in the system. About 2 percent of the messages were inserted into the tables more than an hour after they got into the system.

The queue never has more than 27,000 messages at the same time, and all together, the parallel modules process about 5000 or 6000 messages per minute. So basically, the delay for a single message should never be more than a few minutes. Even if one module gets stuck, another will grab the next message from the queue. I believe the only way for a message to be stuck for so long would be for it to be grabbed by a module, and then for the database write to be somehow delayed for a hour, although it's merely a simple insert performed with a prepared statement.

The database in production is very busy with millions of writes per hour. Could there be a situation in which a particular connection gets "starved" while other connections are able to run queries without noticeable delay?

How can I truck such locks down? Does anybody have any ideas other than starvation? The system lets me view statistics of how many messages were processed in each modules and the average latency. None of the four modules running has long average latency or low number of messages processes, so I don't think the issue is related to any particular thread in my (Java) system being slow or locked away by the others.

TIA,
Herouth

Re: How do I track down a possible locking problem?

From

Jeff Janes

Date:

18 February 2014, 20:03:06

On Mon, Feb 17, 2014 at 8:45 AM, Herouth Maoz <herouth@unicell.co.il> wrote:

I have a production system using Postgresql 9.1.2.

The system basically receives messages, puts them in a queue, and then several parallel modules, each in its own thread, read from that queue, and perform two inserts, then release the message to the next queue for non-database-related processing.

Today, after we received complaints from a customer about delays, I noticed odd behavior in the system. About 2 percent of the messages were inserted into the tables more than an hour after they got into the system.

How do you know that?

The queue never has more than 27,000 messages at the same time, and all together, the parallel modules process about 5000 or 6000 messages per minute. So basically, the delay for a single message should never be more than a few minutes. Even if one module gets stuck, another will grab the next message from the queue. I believe the only way for a message to be stuck for so long would be for it to be grabbed by a module, and then for the database write to be somehow delayed for a hour, although it's merely a simple insert performed with a prepared statement.

The database in production is very busy with millions of writes per hour. Could there be a situation in which a particular connection gets "starved" while other connections are able to run queries without noticeable delay?

If there is a delay like that, it would almost certainly be due to database locks that show up in pg_locks.

http://www.postgresql.org/docs/current/static/view-pg-locks.html

http://wiki.postgresql.org/wiki/Lock_Monitoring

http://wiki.postgresql.org/wiki/Lock_dependency_information

But, I doubt that that is your problem.

How can I truck such locks down? Does anybody have any ideas other than starvation? The system lets me view statistics of how many messages were processed in each modules and the average latency. None of the four modules running has long average latency or low number of messages processes, so I don't think the issue is related to any particular thread in my (Java) system being slow or locked away by the others.

If the insert into PostgreSQL was freezing, wouldn't that time get reflected in your latency monitoring?

It sounds to me like your application has a bug in its queue, where it forgets about items on the queue for a while.

Cheers,

Jeff

Re: How do I track down a possible locking problem?

From

Herouth Maoz

Date:

19 February 2014, 13:40:20

On 18/02/2014, at 19:02, Jeff Janes wrote:

On Mon, Feb 17, 2014 at 8:45 AM, Herouth Maoz <herouth@unicell.co.il> wrote:
I have a production system using Postgresql 9.1.2.

The system basically receives messages, puts them in a queue, and then several parallel modules, each in its own thread, read from that queue, and perform two inserts, then release the message to the next queue for non-database-related processing.

Today, after we received complaints from a customer about delays, I noticed odd behavior in the system. About 2 percent of the messages were inserted into the tables more than an hour after they got into the system.

How do you know that?

The message gets a time stamp (from Java) as it goes into the system. This time stamp is written into the first table, in a field named "time_arrived". The second table's insert (that follows immediately after the first) has a date_inserted field, which also gets a Java time stamp. So basically date_inserted - time_arrived is the interval that passed between the time the message came into the system and the time of the second insertion. That is - the time that it spent in the queue, plus the time that it spent in the first insert.

The queue never has more than 27,000 messages at the same time, and all together, the parallel modules process about 5000 or 6000 messages per minute. So basically, the delay for a single message should never be more than a few minutes. Even if one module gets stuck, another will grab the next message from the queue. I believe the only way for a message to be stuck for so long would be for it to be grabbed by a module, and then for the database write to be somehow delayed for a hour, although it's merely a simple insert performed with a prepared statement.

The database in production is very busy with millions of writes per hour. Could there be a situation in which a particular connection gets "starved" while other connections are able to run queries without noticeable delay?

If there is a delay like that, it would almost certainly be due to database locks that show up in pg_locks.

http://www.postgresql.org/docs/current/static/view-pg-locks.html
http://wiki.postgresql.org/wiki/Lock_Monitoring
http://wiki.postgresql.org/wiki/Lock_dependency_information

But, I doubt that that is your problem.

How can I truck such locks down? Does anybody have any ideas other than starvation? The system lets me view statistics of how many messages were processed in each modules and the average latency. None of the four modules running has long average latency or low number of messages processes, so I don't think the issue is related to any particular thread in my (Java) system being slow or locked away by the others.

If the insert into PostgreSQL was freezing, wouldn't that time get reflected in your latency monitoring?

Not sure what you mean. I think the insert is freezing, and indeed it gets reflected in the time monitored.

It sounds to me like your application has a bug in its queue, where it forgets about items on the queue for a while.

Not likely. This application has been running with the same queue implementation since 2001, even before we started using PostgreSQL... All bugs in that particular code would have been eliminated long ago. The system consists of dozens of queues, and we have monitoring that alerts us to any queue where messages are left unattended. If there was such a bug, we'd see queues that never become empty.

But anyway, I'll put up some cron job that monitors the locks or the query lengths, and see if I can come up with anything. Theoretically, I have a way of logging when each message is enqueued or dequeued from each queue in the system, but this would slow down the production system, so I'll only do that as a last resort.

Thank you for pointing me to the lock monitoring documentation.

Re: How do I track down a possible locking problem?

From

Jeff Janes

Date:

19 February 2014, 20:59:27

On Wed, Feb 19, 2014 at 2:40 AM, Herouth Maoz <herouth@unicell.co.il> wrote:

On 18/02/2014, at 19:02, Jeff Janes wrote:

On Mon, Feb 17, 2014 at 8:45 AM, Herouth Maoz <herouth@unicell.co.il> wrote:
I have a production system using Postgresql 9.1.2.

The system basically receives messages, puts them in a queue, and then several parallel modules, each in its own thread, read from that queue, and perform two inserts, then release the message to the next queue for non-database-related processing.

Today, after we received complaints from a customer about delays, I noticed odd behavior in the system. About 2 percent of the messages were inserted into the tables more than an hour after they got into the system.

How do you know that?

The message gets a time stamp (from Java) as it goes into the system. This time stamp is written into the first table, in a field named "time_arrived". The second table's insert (that follows immediately after the first) has a date_inserted field, which also gets a Java time stamp. So basically date_inserted - time_arrived is the interval that passed between the time the message came into the system and the time of the second insertion. That is - the time that it spent in the queue, plus the time that it spent in the first insert.

So the steps are these?:

message is received by your java system.

java calls gettimeofday and that time is saved locally and eventually put into the first inserted tuple's time_arrived

java does some substantial undescribed stuff

java inserts first tuple and commits

java calls gettimeofday and uses it for second tuple's date_inserted, and does no other meaningful work.

java inserts second tuple and commits

How can I truck such locks down? Does anybody have any ideas other than starvation? The system lets me view statistics of how many messages were processed in each modules and the average latency. None of the four modules running has long average latency or low number of messages processes, so I don't think the issue is related to any particular thread in my (Java) system being slow or locked away by the others.

If the insert into PostgreSQL was freezing, wouldn't that time get reflected in your latency monitoring?

Not sure what you mean. I think the insert is freezing, and indeed it gets reflected in the time monitored.

Wouldn't the average latency then be high, if the average was incorporating even just a few latencies of 3600 seconds?

It sounds to me like your application has a bug in its queue, where it forgets about items on the queue for a while.

Not likely. This application has been running with the same queue implementation since 2001, even before we started using PostgreSQL... All bugs in that particular code would have been eliminated long ago. The system consists of dozens of queues, and we have monitoring that alerts us to any queue where messages are left unattended. If there was such a bug, we'd see queues that never become empty.

But anyway, I'll put up some cron job that monitors the locks or the query lengths, and see if I can come up with anything. Theoretically, I have a way of logging when each message is enqueued or dequeued from each queue in the system, but this would slow down the production system, so I'll only do that as a last resort.

You can also turn "log_lock_waits=on" and something like "log_min_duration_statement = 10000". If there is a PostgreSQL problem, these setting should capture them in the PostgreSQL log file. If there is no problem, they should not generate much log traffic for most systems. (If you have some reporting processes that knowingly run slow queries, you can prevent them from polluting the log by using a ROLE for which log_min_duration_statement is disabled--it is little convoluted because only superuser can change that setting.)

Cheers,

Jeff

Re: How do I track down a possible locking problem?

From

Kevin Grittner

Date:

12 March 2014, 05:33:20

Herouth Maoz <herouth@unicell.co.il> wrote:

> I have a production system using Postgresql 9.1.2.

That's asking for trouble.  There have been many bugs fixed in 9.1
since 2011-12-05, including security vulnerabilities and (more to
the point) bugs which caused vacuum processes to interact poorly
with tables used as queues.  You should really drop in the latest
9.1 minor release.  It does not require any conversion of the data.

http://www.postgresql.org/support/versioning/

> The database in production is very busy with millions of writes
> per hour. Could there be a situation in which a particular
> connection gets "starved" while other connections are able to run
> queries without noticeable delay?

If the issue has not been solved, you might want to read this page
and post to the pgsql-performance list, providing the suggested
information:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

In particular, the number of cores and the setting for
max_connections might suggest a possible cause.  But rather than
guessing in advance of the facts, I suggest monitoring data in the
queue to spot a lingering entry, and capturing the contents of
pg_stat_activity and pg_locks while that condition esists, along
with a couple minutes of output from `vmstat 1`.  That along with
the general information suggested on the above page may allow a
proper diagnosis.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company