Re: [psycopg] Turbo ODBC - Mailing list psycopg

From Matthew Rocklin
Subject Re: [psycopg] Turbo ODBC
Date
Msg-id CAM4sjbmeRHLD9RqmqYP6zZoojy8z7Y+zsRnSQCGUaUXJvd-EkQ@mail.gmail.com
Whole thread Raw
In response to Re: [psycopg] Turbo ODBC  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
List psycopg
CC'ing Wes McKinney in case he's interested in this problem (how do we efficiently move data from Postgres into Python).

On Mon, Jan 16, 2017 at 6:14 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 1/15/17 3:25 PM, Adrian Klaver wrote:
Have you looked at asyncpg:

I'm pulling Matt Rocklin in, who I've been working with on some plpython improvements. Here's his offlist replies to me:

On 1/15/17 5:15 PM, Matthew Rocklin wrote:
You might also contact Wes McKinney and check out
Arrow: http://pyarrow.readthedocs.io/en/latest/

This project was designed for inter-system data interchange.  It would
help with conversion to Spark and other popular tools as well.

On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin@continuum.io
<mailto:mrocklin@continuum.io>> wrote:

    I haven't taken a look at asyncpg.

    Quick note, you might want to be careful about mixing HDFS and
    HDF5.  Postgres's competition for data storage is HDF5 not HDFS
    (Which stands for the Hadoop File System)

    I still think that the best thing to do here would be to dump out
    python arrays for each of the columns in a result set.  I suspect
    that you could beat /any/ system that doesn't do this handily.  This
    would avoid any performance pain of building up Python objects, and
    can continue to use just the standard library.

    You can stick to Pure Python and still send arrays encoded with
    bytes.  We can convert these to NumPy or Pandas trivially with zero
    overhead.

    In [1]: from array import array   # this is in the standard library
    In [2]: b = b'\x00' * 16  # 16 zeroes as a bytestring
    In [3]: a = array('L', b) # eight byte unsigned integer
    In [4]: a
    Out[4]: array('L', [0, 0])

    We would only be bound by

     1. The cost in postgres to tranpose the data from row-major to
        column major
     2. The cost to move bytes across a socket

    This should run at gigabytes per second.  Everything else is likely
    to be competing around to 50-100MB/s range at top speed is my guess.

My guess is that the big gains for external communication would come from:

1) reducing latency effects
2) reducing data marshaling

1 could be accomplished in various ways, and some changes to the FE-BE protocol might be helpful. If we can demonstrate a clear win we might be able to get protocol changes. I do suspect this needs to be tune-able though. My 4x improvement to plpy.execute is changing from the equivalent to a large batch to single row operation. Moving data to python over a local filesystem socket would be hurt be a large batch size, while something on a high latency connection would presumably benefit from moderately large batches since that reduces overall TCP overhead.

I think 2 would necessitate FE-BE changes, although some of that could be tested without the changes. Right now, everything is going to get marshaled into a simple text format (ie: ints become a string of digits), but for test purposes there's ways you could run that through the binary output functions (in the case of int4, you'd get 4 bytes in network order, which should be faster to handle (and could be passed directly to something like ndarray).

The two problems you run into with the array type are handling NULLs and building columns from rows. You could allow the option of specifying that NULLs won't be allowed in specific fields, or you could use something like a Pandas Series that provides other ways of handling the equivalent of NULL.

For the row to column conversion, there's simply no way to get around the pattern of appending items one at a time to an array-like container (list, ndarray, Series, etc), and dealing with the memory allocation problem. I've looked at python's list code and it's certainly no worse at that then what Postgres would do. The big reason I would do that work in python though is you then have the ability to use an number of types for that; they just have to support the equivalent of append().

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

psycopg by date:

Previous
From: Jim Nasby
Date:
Subject: Re: [psycopg] Turbo ODBC
Next
From: Daniel Fortunov
Date:
Subject: [psycopg] Nested transactions support for code composability