Thread: fast read of binary data

fast read of binary data

From
Eildert Groeneveld
Date:
Dear All

I am currently implementing using a compressed binary storage scheme
genotyping data. These are basically vectors of binary data which may be
megabytes in size.

Our current implementation uses the data type bit varying.

What we want to do is very simple: we want to retrieve such records from
the database and transfer it unaltered to the client which will do
something (uncompressing) with it. As massive amounts of data are to be
moved, speed is of great importance, precluding any to and fro
conversions.

Our current implementation uses Perl DBI; we can retrieve the data ok,
but apparently there is some converting going on.

Further, we would like to use ODBC from Fortran90 (wrapping the
C-library)  for such transfers. However, all sorts funny things happen
here which look like conversion issues.

In old fashioned network database some decade ago (in pre SQL times)
this was no problem. Maybe there is someone here who knows the PG
internals sufficiently well to give advice on how big blocks of memory
(i.e. bit varying records) can between transferred UNALTERED between
backend and clients.

looking forward to you response.

greetings

Eildert



--
Eildert Groeneveld
===================================================
Institute of Farm Animal Genetics (FLI)
Mariensee 31535 Neustadt Germany
Tel : (+49)(0)5034 871155 Fax : (+49)(0)5034 871143
e-mail: eildert.groeneveld@fli.bund.de
web:    http://vce.tzv.fal.de
==================================================



Re: fast read of binary data

From
"Albe Laurenz"
Date:
Eildert Groeneveld wrote:
> I am currently implementing using a compressed binary storage scheme
> genotyping data. These are basically vectors of binary data which may be
> megabytes in size.
> 
> Our current implementation uses the data type bit varying.
> 
> What we want to do is very simple: we want to retrieve such records from
> the database and transfer it unaltered to the client which will do
> something (uncompressing) with it. As massive amounts of data are to be
> moved, speed is of great importance, precluding any to and fro
> conversions.
> 
> Our current implementation uses Perl DBI; we can retrieve the data ok,
> but apparently there is some converting going on.
> 
> Further, we would like to use ODBC from Fortran90 (wrapping the
> C-library)  for such transfers. However, all sorts funny things happen
> here which look like conversion issues.
> 
> In old fashioned network database some decade ago (in pre SQL times)
> this was no problem. Maybe there is someone here who knows the PG
> internals sufficiently well to give advice on how big blocks of memory
> (i.e. bit varying records) can between transferred UNALTERED between
> backend and clients.

Using the C API you can specify binary mode for your data, which
meand that they won't be converted.

I don't think you will be able to use this with DBI or ODBC,
but maybe binary corsors can help
(http://www.postgresql.org/docs/current/static/sql-declare.html),
but I don't know if DBI or ODBC handles them well.

If you can avoid DBI or ODBC, that would be best.

Yours,
Laurenz Albe

Re: fast read of binary data

From
Arjen van der Meijden
Date:
On 12-11-2012 11:45, Eildert Groeneveld wrote:
> Dear All
>
> I am currently implementing using a compressed binary storage scheme
> genotyping data. These are basically vectors of binary data which may be
> megabytes in size.
>
> Our current implementation uses the data type bit varying.

Wouldn't 'bytea' be a more logical choice for binary data?
http://www.postgresql.org/docs/9.2/interactive/datatype-binary.html

> What we want to do is very simple: we want to retrieve such records from
> the database and transfer it unaltered to the client which will do
> something (uncompressing) with it. As massive amounts of data are to be
> moved, speed is of great importance, precluding any to and fro
> conversions.
>
> Our current implementation uses Perl DBI; we can retrieve the data ok,
> but apparently there is some converting going on.
>
> Further, we would like to use ODBC from Fortran90 (wrapping the
> C-library)  for such transfers. However, all sorts funny things happen
> here which look like conversion issues.
>
> In old fashioned network database some decade ago (in pre SQL times)
> this was no problem. Maybe there is someone here who knows the PG
> internals sufficiently well to give advice on how big blocks of memory
> (i.e. bit varying records) can between transferred UNALTERED between
> backend and clients.

Although I have no idea whether bytea is treated differently in this
context. Bit varying should be about as simple as possible (given that
it only has 0's and 1's)

Best regards,

Arjen


Re: fast read of binary data

From
Merlin Moncure
Date:
On Mon, Nov 12, 2012 at 4:45 AM, Eildert Groeneveld
<eildert.groeneveld@fli.bund.de> wrote:
> Dear All
>
> I am currently implementing using a compressed binary storage scheme
> genotyping data. These are basically vectors of binary data which may be
> megabytes in size.
>
> Our current implementation uses the data type bit varying.
>
> What we want to do is very simple: we want to retrieve such records from
> the database and transfer it unaltered to the client which will do
> something (uncompressing) with it. As massive amounts of data are to be
> moved, speed is of great importance, precluding any to and fro
> conversions.
>
> Our current implementation uses Perl DBI; we can retrieve the data ok,
> but apparently there is some converting going on.
>
> Further, we would like to use ODBC from Fortran90 (wrapping the
> C-library)  for such transfers. However, all sorts funny things happen
> here which look like conversion issues.
>
> In old fashioned network database some decade ago (in pre SQL times)
> this was no problem. Maybe there is someone here who knows the PG
> internals sufficiently well to give advice on how big blocks of memory
> (i.e. bit varying records) can between transferred UNALTERED between
> backend and clients.
>
> looking forward to you response.

Fastest/best way to transfer binary data to/from postgres is going to
mean direct coding against libpq since most drivers wall you off from
the binary protocol (this may or may not be the case with ODBC).  If I
were you I'd be writing C code to manage the database and linking the
C compiled object to the Fortran application.  Assuming the conversion
doesn't go the way you want (briefly looking, there is a 'bytea as LO'
option you may want to explore), ODBC brings nothing but complication
in this regard unless your application has to support multiple
database vendors or you have zero C chops in-house.

merlin


Re: fast read of binary data

From
Eildert Groeneveld
Date:
On Mo, 2012-11-12 at 12:18 +0100, Albe Laurenz wrote:
> Eildert Groeneveld wrote:
> > I am currently implementing using a compressed binary storage scheme
> > genotyping data. These are basically vectors of binary data which may be
> > megabytes in size.
> >
> > Our current implementation uses the data type bit varying.
> >
> > What we want to do is very simple: we want to retrieve such records from
> > the database and transfer it unaltered to the client which will do
> > something (uncompressing) with it. As massive amounts of data are to be
> > moved, speed is of great importance, precluding any to and fro
> > conversions.
> >
> > Our current implementation uses Perl DBI; we can retrieve the data ok,
> > but apparently there is some converting going on.
> >
> > Further, we would like to use ODBC from Fortran90 (wrapping the
> > C-library)  for such transfers. However, all sorts funny things happen
> > here which look like conversion issues.
> >
> > In old fashioned network database some decade ago (in pre SQL times)
> > this was no problem. Maybe there is someone here who knows the PG
> > internals sufficiently well to give advice on how big blocks of memory
> > (i.e. bit varying records) can between transferred UNALTERED between
> > backend and clients.
>
> Using the C API you can specify binary mode for your data, which
> meand that they won't be converted.
>
> I don't think you will be able to use this with DBI or ODBC,
> but maybe binary corsors can help
> (http://www.postgresql.org/docs/current/static/sql-declare.html),
> but I don't know if DBI or ODBC handles them well.
>
> If you can avoid DBI or ODBC, that would be best.
ok, I did have a look at the libpq librar, and you are right, there is a
way to obtain binary data from the backend through the  PQexecParams

    res = PQexecParams(conn,
                       "DECLARE myportal CURSOR FOR select genotype_bits
from v_genotype_data",
                       0,       /* zero param */
                       NULL,    /* let the backend deduce param type */
                       paramValues,
                       NULL,    /* don't need param lengths since text*/
                       NULL,    /* default to all text params */
                       1);      /* ask for binary results */

genotype_bits is defined as bit varying in the backend. When writing the
results:
    for (i = 0; i < PQntuples(res); i++)
    {
        for (j = 0; j < nFields; j++)
      fwrite(PQgetvalue(res, i, j),100000,1,f);
    }

it is clear that the results are NOT in binary format:
eg(eno,snp): od -b junk |head
0000000 061 060 061 060 061 060 061 060 061 060 061 060 061 060 061 060

clearly, these are nice 0 and 1 in ASCII and not as I need it as a bit
stream.

Also, (and in line with this) PQgetvalue(res, i, j) seems to be of type
text.

What am I missing?






Re: fast read of binary data

From
Heikki Linnakangas
Date:
On 22.11.2012 09:54, Eildert Groeneveld wrote:
> ok, I did have a look at the libpq librar, and you are right, there is a
> way to obtain binary data from the backend through the  PQexecParams
>
>      res = PQexecParams(conn,
>                         "DECLARE myportal CURSOR FOR select genotype_bits
> from v_genotype_data",
>                         0,       /* zero param */
>                         NULL,    /* let the backend deduce param type */
>                         paramValues,
>                         NULL,    /* don't need param lengths since text*/
>                         NULL,    /* default to all text params */
>                         1);      /* ask for binary results */
>
> genotype_bits is defined as bit varying in the backend. When writing the
> results:
>      for (i = 0; i<  PQntuples(res); i++)
>      {
>          for (j = 0; j<  nFields; j++)
>        fwrite(PQgetvalue(res, i, j),100000,1,f);
>      }
>
> it is clear that the results are NOT in binary format:
> eg(eno,snp): od -b junk |head
> 0000000 061 060 061 060 061 060 061 060 061 060 061 060 061 060 061 060

You need to ask for binary results when you execute the FETCH
statements. Asking for binary results on the DECLARE CURSOR statement
has no effect, as DECLARE CURSOR doesn't return any results; it's the
FETCH that follows that returns the result set.

- Heikki