Re: general PG network slowness (possible cure) (repost) - Mailing list pgsql-performance

From Peter T. Breuer
Subject Re: general PG network slowness (possible cure) (repost)
Date
Msg-id 200705251323.l4PDNIY30138@inv.it.uc3m.es
Whole thread Raw
In response to general PG network slowness (possible cure) (repost)  ("Peter T. Breuer" <ptb@inv.it.uc3m.es>)
List pgsql-performance
"Also sprach Kenneth Marshall:"
> > Surprise, ...  I got a speed up of hundreds of times.  The same application
> > that crawled under my original rgdbm implementation and under PG now
> > maxed out the network bandwidth at close to a full 10Mb/s and 1200
> > pkts/s, at 10% CPU on my 700MHz client, and a bit less on the 1GHz
> > server.
> >
> > So
> >
> >   * Is that what is holding up postgres over the net too?  Lots of tiny
> >     packets?
>
>
> This effect is very common, but you are in effect altering the query/

I imagined so, but no, I am not changing the behaviour - I believe you
are imagining something different here.  Let me explain.

It is usually the case that drivers and the network layer conspire to
emit packets when they are otherwise idle, since they have nothing
better to do.  That is, if the transmission unit is the normal 1500B and
there is 200B in the transmission buffer and nothing else is frisking
them about the chops, something along the line will shrug and say, OK,
I'll just send out a 200B fragment now, apologize, and send out another
fragment later if anything else comes along for me to chunter out.

It is also the case that drivers do the opposite .. that is, they do
NOT send out packets when the transmission buffer is full, even if they
have 1500B worth. Why? Well, on Ge for sure, and on 100BT most of the
time, it doesn't pay to send out individual packets because the space
required between packets is relatively too great to permit the network
to work at that speed  given the speed of light as it is, and the
spacing it implies between packets (I remember when I advised the
networking protocol people that Ge was a coming thing about 6 years
ago, they all protested and said it was _physically_ impossible. It is.
If you send packets one by one!).  An ethernet line is fundamentally
only electrical and only signals up or down (relative) and needs time to
quiesce. And then there's the busmastering .. a PCI bus is only about
33MHz, and 32 bits wide (well, or 16 on portables, or even 64, but
you're getting into heavy server equipment then).  That's 128MB/s in
one direction, and any time one releases the bus there's a re-setup time
that costs the earth and will easily lower bandwidth by 75%. So drivers
like to take the bus for a good few packets at a time. Even a single
packet (1500B) will take 400 multi-step bus cycles to get to the
card, and then it's a question of how much onboard memory it has or
whether one has to drive it synchronously. Most cards have something
like a 32-unit ring buffer, and I think each unit is considerable.

Now, if a driver KNOWS what's coming then it can alter its behavior in
order to mesh properly with the higher level layers. What I did was
_tell_ the driver and the protocol not to send any data until I well
and truly tell it to, and then told it to, when I was ready.  The result
is that a full communication unit (start, header, following data, and
stop codon) was sent in one blast.

That meant that there were NO tiny fragments blocking up the net, being
sent wily-nily. And it also meant that the driver was NOT waiting for
more info to come in before getting bored and sending out what it had.
It did as I told it to.

The evidence from monitoring the PG network thruput is that 75% of its
packets are in the 64-128B range, including tcp header. That's hitting
the 100Kb/s (10KB/s) bandwidth regime on my network at the lower end.
It will be even _worse_ on a faster net, I think (feel free to send me a
faster net to compare with :).

I also graphed latency, but I haven't taken into account the results as
the bandwidth measurements were so striking.

> response behavior of the database. Most applications expect an answer
> from the database after every query.

Well of course. Nothing else would work! (I imagine you have some kind
of async scheme, but I haven't investigated). I ask, the db replies. I
ask, the db replies. What I did was

  1) made the ASK go out as one lump.
  2) made the REPLY go out as one lump
  3) STOPPED the card waiting for several replies or asks to accumulate
     before sending out anything at all.

> If it could manage retrying failed
> queries later, you could use the typical sliding window/delayed ack
> that is so useful in improving the bandwidth utilization of many network

That is not what is going on (though that's not a bad idea). See
above for the explanation. One has to take into account the physical
hardware involved and its limitations, and arrange the communications
accordingly. All I did was send EACH query and EACH response as a
single unit, at the hardware level.

One could do better still by managing _several_ threads communications
at once.

> programs. Maybe an option in libpq to tell it to use delayed "acks". I
> do not know what would be involved.

Nothing spectacular is required to see a considerable improvement, I
think,.  apart from a little direction from the high level protocol down
to the driver about where the communication boundaries are.  1000%
speedup in my case.

Now, where is the actual socket send done in the pg code? I'd like to
check what's happening in there.




Peter

pgsql-performance by date:

Previous
From: Kristo Kaiv
Date:
Subject: Re: My quick and dirty "solution" (Re: Performance Problem with Vacuum of bytea table (PG 8.0.13))
Next
From: "Peter T. Breuer"
Date:
Subject: Re: general PG network slowness (possible cure) (repost)