Re: Creating large database of MD5 hash values - Mailing list pgsql-performance

From Florian Weimer
Subject Re: Creating large database of MD5 hash values
Date
Msg-id 82skxsqq67.fsf@mid.bfk.de
Whole thread Raw
In response to Re: Creating large database of MD5 hash values  ("Jon Stewart" <jonathan.l.stewart@gmail.com>)
List pgsql-performance
* Jon Stewart:

>>  BYTEA is slower to load and a bit inconvenient to use from DBI, but
>>  occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33
>>  bytes with PostgreSQL 8.3).

> Can you clarify the "slower to load" point? Where is that pain point
> in the postgres architecture?

COPY FROM needs to read 2.5 bytes on average, instead 2, and a complex
form of double-decoding is necessary.

> Storing the values in binary makes intuitive sense to me since the
> data is twice as dense, thus getting you more bang for the buck on
> comparisons, caching, and streaming reads. I'm not too concerned about
> raw convenience, as there's not going to be a lot of code around my
> application.

The main issue is that you can't use the parameter-providing version
of $sth->execute (or things like $sth->selectarray, $sth->do), you
must use explicit binding by parameter index in order to specify the
type information.

> The idea is that you have named sets of hash values, and hash values
> can be in multiple sets.

The ID step is only going to help you if your sets are very large and
you use certain types of joins, I think.  So it's better to
denormalize in this case (if that's what you were alluding to in your
original post).

> The big operations will be to calculate the unions, intersections, and
> differences between sets. That is, I'll load a new set into the
> database and then see whether it has anything in common with another
> set (probably throw the results into a temp table and then dump it
> out).

In this case, PostgreSQL's in-memory bitmap indices should give you
most of the effect of your hash <-> ID mapping anyway.

> I will also periodically run queries to determine the size of
> the intersection of two sets for all pairs of sets (in order to
> generate some nice graphs).

I think it's very difficult to compute that efficiently, but I haven't
thought much about it.  This type of query might benefit from your
hash <-> ID mapping, however, because the working set is smaller.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

pgsql-performance by date:

Previous
From: Matthew
Date:
Subject: Re: Performance increase with elevator=deadline
Next
From: Craig Ringer
Date:
Subject: Re: Performance increase with elevator=deadline