Thread: errors with high connections rate

errors with high connections rate

From
Pawel Veselov
Date:
Hi.

-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3 (Amazon AMI distro).
The application writes data at a high rate (at this point it's 500 transaction per second), using multiple threads (at this point it's 800).

These are "worker" threads, that receive "messages" that are then written out to the DB. There is no connection pool, instead, each worker thread maintains it's own connection that it uses to write data to the database. The connections are kept pthread's "specific" data blocks.

Each thread would connect to the DB when the first work message is received, or when there was an "error" flag with a connection. The error flag is set any time there is any error running a database statement.

When the work is "slow", I don't see any problem (slow was ~250 messages per second). As I increased the load, when I restart the process, threads start grabbing work at high enough rate, and each will first open a connection to the database, and these errors start popping up:

Can't connect to DB: could not send data to server: Transport endpoint is not connected
could not send startup packet: Transport endpoint is not connected

This is a result of executing the following code:

    wi->pg_conn = PQconnectdb(conn_str);
    ConnectionStatusType cst = PQstatus(wi->pg_conn);

    if (cst != CONNECTION_OK) {
        ERR("Can't connect to DB: %s\n", PQerrorMessage(wi->pg_conn));
    }

Eventually, the errors go away (when the worker thread fail to connect, they just pass the message to another thread, and wait for their turn, and will try reconnecting again), so it does seem that the remedy is just spreading the connections in time.

The connection string is '' (empty), the connection is made through /tmp/.s.PGSQL.5432

I don't see these errors when:
1) the amount of worker threads is reduced (could never reproduce it under 200 or less, but seen them with 300 and more)
2) the amount of load is reduced

-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce it, at least to see what's going on, but sometimes I get another error : "too many users connected". Even restarting postmaster doesn't help. The postmaster is running with -N810, and the role has connection limit of 1000. Yet, the "too many" error starts creeping up only after 275 connections are opened (counted by successful connect() from strace).

Any idea where should I dig?

P.S. I looked at fe-connect.c, I'm wondering if there a potential race condition between poll() and socket actually finishing the connection? If running under strace, I never see EINPROGRESS returned from connect(), and the only reason sendto() would result into ENOTCONN is when the connect didn't finish, and the socket was deemed "connected" using poll/getsockopt...

Thanks,
  Pawel.

Re: errors with high connections rate

From
Craig Ringer
Date:
On 07/03/2012 03:19 PM, Pawel Veselov wrote:
Hi.

-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3 (Amazon AMI distro).
The application writes data at a high rate (at this point it's 500 transaction per second), using multiple threads (at this point it's 800).

These are "worker" threads, that receive "messages" that are then written out to the DB. There is no connection pool, instead, each worker thread maintains it's own connection that it uses to write data to the database. The connections are kept pthread's "specific" data blocks.

Hmm. To get that kind of TPS with that design are you running with fsync=off or on storage that claims to flush I/O without actually doing so? Have you checked your crash safety? Is it just fairly big hardware?

Why are you using so many connections? Unless you have truly monstrous hardware your system should achieve considerably greater throughput by reducing the connection count and queueing bursts of writes. You wouldn't even need an external pool in your case, just switch to a producer/consumer model where your accepting threads add work to separate and much fewer writer threads for sending to the DB. Writer threads could then do useful optimisations like multi-value-inserting or COPYing data, doing small batches in transactions, etc.

I'm seriously impressed that your system is working under load at all with 800 concurrent connections fighting to write all at once.


Can't connect to DB: could not send data to server: Transport endpoint is not connected
could not send startup packet: Transport endpoint is not connected

postmaster forking and failing because of operating system resource limits like max proc count, anti-forkbomb measures, max file handles, etc?

-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce it, at least to see what's going on, but sometimes I get another error : "too many users connected". Even restarting postmaster doesn't help. The postmaster is running with -N810, and the role has connection limit of 1000. Yet, the "too many" error starts creeping up only after 275 connections are opened (counted by successful connect() from strace).

Any idea where should I dig?
See how many connections the *server* thinks exist by examining pg_stat_activity .

Check dmesg and the PostgreSQL server logs to see if you're hitting operating system limits. Look for fork() failures, unexplained segfaults, etc.

--
Craig Ringer

Re: errors with high connections rate

From
John R Pierce
Date:
On 07/03/12 12:34 AM, Craig Ringer wrote:
> I'm seriously impressed that your system is working under load at all
> with 800 concurrent connections fighting to write all at once.

indeed, in my transactional benchmarks on a 12 core, 24 thread dual xeon
x5600 class systems, with 16 or 20 spindle raid10, I find somewherre
around 50 to 80 database connection threads has the highest overall
throughput (several thousand OLTP transactions/second).    this hardware
has vastly better IO and CPU performance than any AWS virtual machine.


as craig suggested, your network threads could put the incoming requests
into queue(s), and run a tunable number of database connection threads
that take requests out of the queue and send them to the database, and
if neccessary, return results to the network thread.   doing this will
give better CPU utilization, you can try different database worker
thread counts til you hit the optimal number for your hardware.



--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast


Re: errors with high connections rate

From
"Pawel S. Veselov"
Date:
On 07/03/2012 12:54 AM, John R Pierce wrote:
> On 07/03/12 12:34 AM, Craig Ringer wrote:
>> I'm seriously impressed that your system is working under load at all
>> with 800 concurrent connections fighting to write all at once.
>
> indeed, in my transactional benchmarks on a 12 core, 24 thread dual
> xeon x5600 class systems, with 16 or 20 spindle raid10, I find
> somewherre around 50 to 80 database connection threads has the highest
> overall throughput (several thousand OLTP transactions/second).
> this hardware has vastly better IO and CPU performance than any AWS
> virtual machine.
>
>
> as craig suggested, your network threads could put the incoming
> requests into queue(s), and run a tunable number of database
> connection threads that take requests out of the queue and send them
> to the database, and if neccessary, return results to the network
> thread.   doing this will give better CPU utilization, you can try
> different database worker thread counts til you hit the optimal number
> for your hardware.
>
Just to clear the air on this, this is almost exactly what I'm doing.
The number of 800 came out of experimenting with numbers (I'm sure it
took you some time to find the optimum of 50-80 for your configuration).
The number of "worker" threads are configurable, and they do receive
their work from a shared queue. By the way, on the operations that I'm
doing, postgres is performing very well, with average of less than 10ms
per transaction, with throughput of times over 600 tps.

However, writing data to postgres is not the only thing I need to do to
process the data. If the time to process rises for other reasons, low
number of threads may not be able to withstand constant stream of
incoming data, and I have to raise the worker thread number to
compensate. As I was doing this, I ran into the problem described in the
original email, and it puzzled me. However, only because I opened 800
connections, doesn't mean that all of the connections are being being
actively used concurrently (so not that much fighting). I indeed should
switch to a connection pool model in such a case, just to not over-fork
postgres, however, I don't see that postgres is consuming any
significant amount of system resources by forked server processes.

Thank you,
   Pawel.


Re: errors with high connections rate

From
"Pawel S. Veselov"
Date:
On 07/03/2012 12:34 AM, Craig Ringer wrote:
On 07/03/2012 03:19 PM, Pawel Veselov wrote:
Hi.

-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3 (Amazon AMI distro).
The application writes data at a high rate (at this point it's 500 transaction per second), using multiple threads (at this point it's 800).

These are "worker" threads, that receive "messages" that are then written out to the DB. There is no connection pool, instead, each worker thread maintains it's own connection that it uses to write data to the database. The connections are kept pthread's "specific" data blocks.

[skipped, replied to separately]


Can't connect to DB: could not send data to server: Transport endpoint is not connected
could not send startup packet: Transport endpoint is not connected

postmaster forking and failing because of operating system resource limits like max proc count, anti-forkbomb measures, max file handles, etc?

If accept() succeeded, and fork() failed, the socket would be closed by the process (parent will close, child socket wouldn't even be forked), wouldn't that result into ECONNRESET, and not ENOTCONN?


-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce it, at least to see what's going on, but sometimes I get another error : "too many users connected". Even restarting postmaster doesn't help. The postmaster is running with -N810, and the role has connection limit of 1000. Yet, the "too many" error starts creeping up only after 275 connections are opened (counted by successful connect() from strace).

Any idea where should I dig?
See how many connections the *server* thinks exist by examining pg_stat_activity .

Check dmesg and the PostgreSQL server logs to see if you're hitting operating system limits. Look for fork() failures, unexplained segfaults, etc.

That's the thing, no segfaults (dmesg), nothing in the server logs.

It may as well be some sort of an anti-fork-bomb measure, only judging by the fact that with enough attempts, things do clear out, though I wish there would be some indication of that, and I'm still confused about the error code being ENOTCONN.

Re: errors with high connections rate

From
"Kevin Grittner"
Date:
John R Pierce
> On 07/03/12 12:34 AM, Craig Ringer wrote:
>> I'm seriously impressed that your system is working under load at
>> all with 800 concurrent connections fighting to write all at once.
>
> indeed, in my transactional benchmarks on a 12 core, 24 thread dual
> xeon x5600 class systems, with 16 or 20 spindle raid10, I find
> somewherre around 50 to 80 database connection threads has the
> highest overall throughput (several thousand OLTP
> transactions/second). this hardware has vastly better IO and CPU
> performance than any AWS virtual machine.
>
>
> as craig suggested, your network threads could put the incoming
> requests into queue(s), and run a tunable number of database
> connection threads that take requests out of the queue and send
> them to the database, and if neccessary, return results to the
> network thread. doing this will give better CPU utilization, you
> can try different database worker thread counts til you hit the
> optimal number for your hardware.

+1

We (at the Wisconsin courts) have definitely found that the best
model for us is to have a separate layer for running database
transactions, with one thread per database connection and each of
those threads pulling from a prioritized FIFO queue into which
*other* layers place requests.

This comes up so often that I threw together a Wiki page for it:

http://wiki.postgresql.org/wiki/Number_Of_Database_Connections

Of course, everyone should feel free to improve the page.

-Kevin

Re: errors with high connections rate

From
Craig Ringer
Date:
On 07/03/2012 04:26 PM, Pawel S. Veselov wrote:

> That's the thing, no segfaults (dmesg), nothing in the server logs.
>
> It may as well be some sort of an anti-fork-bomb measure, only judging
> by the fact that with enough attempts, things do clear out, though I
> wish there would be some indication of that, and I'm still confused
> about the error code being ENOTCONN.
>

I've managed to produce the endpoint not connected errors with a little
test I wrote here. Only once so far and only during an abnormal test run
where I signalled the test workers as they were starting up, so that's
not really very helpful.

I have no problem using a little Python test program to create 800
connections in about a second. It forks some workers (100 by default)
which grab enough connections each to reach the target connection count.

Ooh, handy. I just triggered it again now. The "Transport endpoint is
not connected" messages were intermixed with some "FATAL:  sorry, too
many clients already" messages. The PostgreSQL log is full of FATAL:
sorry, too many clients already" messages intermixed with "LOG:
unexpected EOF on client connection" messages. Again it was an abnormal
run where I signalled my workers mid way through startup.

Interesting, that. I've never seen it on a run where I don't send a
signal. You know what that makes me think? You're using a multithreaded
approach, and there's something going wrong in your app's innards. Yes,
that's a lot of hot air and handwaving, but it fits - you're getting an
error saying that psql is trying to operate on a socket that isn't there.

The fact that there's nothing in the system logs or Pg logs just adds
weight to that. I'm guessing you have a threading bug, possibly signal
related.

--
Craig Ringer

Re: errors with high connections rate

From
Craig Ringer
Date:
Here's the test program, btw:

https://github.com/ringerc/scrapcode/tree/master/scripts/pg_forktest

pgfork.py is a home rolled fork() horror.

pg_mp.py is the same thing done with Python's multiprocessing module.

--
Craig Ringer