Re: [psycopg] Turbo ODBC - Mailing list psycopg

From Wes McKinney
Subject Re: [psycopg] Turbo ODBC
Date
Msg-id CAJPUwMA2gnH+ebzGpWfKJL1xa5r2VFcdHUibK_aiLLccDQPc6g@mail.gmail.com
Whole thread Raw
In response to [psycopg] Turbo ODBC  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
List psycopg
Adding Uwe Korn -- he's been working on adding C++-level Arrow support
to TurbODBC:

https://github.com/blue-yonder/turbodbc/pull/26

We're using Arrow data structures as an efficient (column-oriented)
intermediary en route to pandas -- you could also implement normal
Python tuple iteration on top of Arrow record batches.

The main Python-compatible Arrow implementation is in C++11 so if you
wanted to use in C you'd need to do a little work to create C
bindings.

On Mon, Jan 16, 2017 at 6:24 PM, Matthew Rocklin <mrocklin@continuum.io> wrote:
> CC'ing Wes McKinney in case he's interested in this problem (how do we
> efficiently move data from Postgres into Python).
>
> On Mon, Jan 16, 2017 at 6:14 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>
>> On 1/15/17 3:25 PM, Adrian Klaver wrote:
>>>
>>> Have you looked at asyncpg:
>>
>>
>> I'm pulling Matt Rocklin in, who I've been working with on some plpython
>> improvements. Here's his offlist replies to me:
>>
>>> On 1/15/17 5:15 PM, Matthew Rocklin wrote:
>>>>
>>>> You might also contact Wes McKinney and check out
>>>> Arrow: http://pyarrow.readthedocs.io/en/latest/
>>>>
>>>> This project was designed for inter-system data interchange.  It would
>>>> help with conversion to Spark and other popular tools as well.
>>>>
>>>> On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin@continuum.io
>>>> <mailto:mrocklin@continuum.io>> wrote:
>>>>
>>>>     I haven't taken a look at asyncpg.
>>>>
>>>>     Quick note, you might want to be careful about mixing HDFS and
>>>>     HDF5.  Postgres's competition for data storage is HDF5 not HDFS
>>>>     (Which stands for the Hadoop File System)
>>>>
>>>>     I still think that the best thing to do here would be to dump out
>>>>     python arrays for each of the columns in a result set.  I suspect
>>>>     that you could beat /any/ system that doesn't do this handily.  This
>>>>     would avoid any performance pain of building up Python objects, and
>>>>     can continue to use just the standard library.
>>>>
>>>>     You can stick to Pure Python and still send arrays encoded with
>>>>     bytes.  We can convert these to NumPy or Pandas trivially with zero
>>>>     overhead.
>>>>
>>>>     In [1]: from array import array   # this is in the standard library
>>>>     In [2]: b = b'\x00' * 16  # 16 zeroes as a bytestring
>>>>     In [3]: a = array('L', b) # eight byte unsigned integer
>>>>     In [4]: a
>>>>     Out[4]: array('L', [0, 0])
>>>>
>>>>     We would only be bound by
>>>>
>>>>      1. The cost in postgres to tranpose the data from row-major to
>>>>         column major
>>>>      2. The cost to move bytes across a socket
>>>>
>>>>     This should run at gigabytes per second.  Everything else is likely
>>>>     to be competing around to 50-100MB/s range at top speed is my guess.
>>
>>
>> My guess is that the big gains for external communication would come from:
>>
>> 1) reducing latency effects
>> 2) reducing data marshaling
>>
>> 1 could be accomplished in various ways, and some changes to the FE-BE
>> protocol might be helpful. If we can demonstrate a clear win we might be
>> able to get protocol changes. I do suspect this needs to be tune-able
>> though. My 4x improvement to plpy.execute is changing from the equivalent to
>> a large batch to single row operation. Moving data to python over a local
>> filesystem socket would be hurt be a large batch size, while something on a
>> high latency connection would presumably benefit from moderately large
>> batches since that reduces overall TCP overhead.
>>
>> I think 2 would necessitate FE-BE changes, although some of that could be
>> tested without the changes. Right now, everything is going to get marshaled
>> into a simple text format (ie: ints become a string of digits), but for test
>> purposes there's ways you could run that through the binary output functions
>> (in the case of int4, you'd get 4 bytes in network order, which should be
>> faster to handle (and could be passed directly to something like ndarray).
>>
>> The two problems you run into with the array type are handling NULLs and
>> building columns from rows. You could allow the option of specifying that
>> NULLs won't be allowed in specific fields, or you could use something like a
>> Pandas Series that provides other ways of handling the equivalent of NULL.
>>
>> For the row to column conversion, there's simply no way to get around the
>> pattern of appending items one at a time to an array-like container (list,
>> ndarray, Series, etc), and dealing with the memory allocation problem. I've
>> looked at python's list code and it's certainly no worse at that then what
>> Postgres would do. The big reason I would do that work in python though is
>> you then have the ability to use an number of types for that; they just have
>> to support the equivalent of append().
>>
>> --
>> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
>> Experts in Analytics, Data Architecture and PostgreSQL
>> Data in Trouble? Get it in Treble! http://BlueTreble.com
>> 855-TREBLE2 (855-873-2532)
>
>


psycopg by date:

Previous
From: Christophe Pettus
Date:
Subject: Re: [psycopg] Nested transactions support for code composability
Next
From: Adrian Klaver
Date:
Subject: Re: [psycopg] Turbo ODBC