Re: Replication on the backend - Mailing list pgsql-hackers

From J. Andrew Rogers
Subject Re: Replication on the backend
Date
Msg-id 3FA96BF2-6A09-48FD-9695-381AC1513A10@neopolitan.com
Whole thread Raw
In response to Re: Replication on the backend  (Markus Schiltknecht <markus@bluegap.ch>)
List pgsql-hackers
On Dec 6, 2005, at 11:42 PM, Markus Schiltknecht wrote:
> Does anybody have latency / roundtrip measurements for current  
> hardware?
> I'm interested in:
>     1Gb Ethernet,
>     10 Gb Ethernet,
>     InfiniBand,
>     probably even p2p usb2 or firewire links?


In another secret life, I know a bit about supercomputing fabrics.   
The latency metrics have to be thoroughly qualified.

First, most of the RTT latency numbers for network fabrics are for 0  
byte packet sizes, which really does not apply to anyone shuffling  
real data around.  For small packets, high-performance fabrics (HTX  
Infiniband, Quadrics, etc) have approximately an order of magnitude  
less latency than vanilla Ethernet, though the performance specifics  
depend greatly on the actual usage.  For large packet sizes, the  
differences in latency become far less obvious.  However, for "real"  
packets a performant fabric will still look very good compared to  
disk systems.  Switched fiber fabrics have enough relatively  
inexpensive throughput now to saturate most disk systems and CPU I/O  
busses; only platforms like HyperTransport can really keep up.  It is  
worth pointing out that the latency of high-end network fabrics is  
similar to large NUMA fabrics, which exposes some of the limits of  
SMP scalability.  As a point of reference, an organization that knows  
what they are doing should have no problem getting 500 microsecond  
RTT on a vanilla metropolitan area GigE fiber network -- a few  
network operators actually do deliver this on a regional scale.  For  
a local cluster, a competent design can best this by orders of  
magnitude.

There are a number of silicon limitations, but a system that connects  
the fabric directly to HyperTransport can drive several GB/s with  
very respectable microsecond latencies if the rest of the system is  
up to it.  There are Opteron system boards now that will drive  
Infiniband directly from HyperTransport.  I know Arima/Rioworks makes  
some (great server boards generally), and several other companies are  
either making them or have announced them in the pipeline.  These  
Opteron boards get pretty damn close to Big Iron SMP fabric  
performance in a cheap package.  Given how many companies have  
announced plans to produce Opteron server boards with Infiniband  
fabrics directly integrated into HyperTransport, I would say that  
this is the future of server boards.

And if postgres could actually use an infiniband fabric for  
clustering a single database instance across Opteron servers, that  
would be very impressive...

J. Andrew Rogers




pgsql-hackers by date:

Previous
From: Harald Fuchs
Date:
Subject: Re: Oddity with extract microseconds?
Next
From: "Luke Lonergan"
Date:
Subject: Re: Replication on the backend