Re: [psycopg] Turbo ODBC - Mailing list psycopg

From Jim Nasby
Subject Re: [psycopg] Turbo ODBC
Date
Msg-id 57df076e-d215-a962-94b5-496169401452@BlueTreble.com
Whole thread Raw
In response to Re: [psycopg] Turbo ODBC  (Adrian Klaver <adrian.klaver@aklaver.com>)
Responses Re: [psycopg] Turbo ODBC  (Matthew Rocklin <mrocklin@continuum.io>)
List psycopg
On 1/15/17 3:25 PM, Adrian Klaver wrote:
> Have you looked at asyncpg:

I'm pulling Matt Rocklin in, who I've been working with on some plpython
improvements. Here's his offlist replies to me:

> On 1/15/17 5:15 PM, Matthew Rocklin wrote:
>> You might also contact Wes McKinney at check out
>> Arrow: http://pyarrow.readthedocs.io/en/latest/
>>
>> This project was designed for inter-system data interchange.  It would
>> help with conversion to Spark and other popular tools as well.
>>
>> On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin@continuum.io
>> <mailto:mrocklin@continuum.io>> wrote:
>>
>>     I haven't taken a look at asyncpg.
>>
>>     Quick note, you might want to be careful about mixing HDFS and
>>     HDF5.  Postgres's competition for data storage is HDF5 not HDFS
>>     (Which stands for the Hadoop File System)
>>
>>     I still think that the best thing to do here would be to dump out
>>     python arrays for each of the columns in a result set.  I suspect
>>     that you could beat /any/ system that doesn't do this handily.  This
>>     would avoid any performance pain of building up Python objects, and
>>     can continue to use just the standard library.
>>
>>     You can stick to Pure Python and still send arrays encoded with
>>     bytes.  We can convert these to NumPy or Pandas trivially with zero
>>     overhead.
>>
>>     In [1]: from array import array   # this is in the standard library
>>     In [2]: b = b'\x00' * 16  # 16 zeroes as a bytestring
>>     In [3]: a = array('L', b) # eight byte unsigned integer
>>     In [4]: a
>>     Out[4]: array('L', [0, 0])
>>
>>     We would only be bound by
>>
>>      1. The cost in postgres to tranpose the data from row-major to
>>         column major
>>      2. The cost to move bytes across a socket
>>
>>     This should run at gigabytes per second.  Everything else is likely
>>     to be competing around to 50-100MB/s range at top speed is my guess.

My guess is that the big gains for external communication would come from:

1) reducing latency effects
2) reducing data marshaling

1 could be accomplished in various ways, and some changes to the FE-BE
protocol might be helpful. If we can demonstrate a clear win we might be
able to get protocol changes. I do suspect this needs to be tune-able
though. My 4x improvement to plpy.execute is changing from the
equivalent to a large batch to single row operation. Moving data to
python over a local filesystem socket would be hurt be a large batch
size, while something on a high latency connection would presumably
benefit from moderately large batches since that reduces overall TCP
overhead.

I think 2 would necessitate FE-BE changes, although some of that could
be tested without the changes. Right now, everything is going to get
marshaled into a simple text format (ie: ints become a string of
digits), but for test purposes there's ways you could run that through
the binary output functions (in the case of int4, you'd get 4 bytes in
network order, which should be faster to handle (and could be passed
directly to something like ndarray).

The two problems you run into with the array type are handling NULLs and
building columns from rows. You could allow the option of specifying
that NULLs won't be allowed in specific fields, or you could use
something like a Pandas Series that provides other ways of handling the
equivalent of NULL.

For the row to column conversion, there's simply no way to get around
the pattern of appending items one at a time to an array-like container
(list, ndarray, Series, etc), and dealing with the memory allocation
problem. I've looked at python's list code and it's certainly no worse
at that then what Postgres would do. The big reason I would do that work
in python though is you then have the ability to use an number of types
for that; they just have to support the equivalent of append().
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


psycopg by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: [psycopg] Turbo ODBC
Next
From: Matthew Rocklin
Date:
Subject: Re: [psycopg] Turbo ODBC