Re: Real-life range datasets - Mailing list pgsql-hackers

From Oleg Bartunov
Subject Re: Real-life range datasets
Date
Msg-id Pine.LNX.4.64.1112221945580.14072@sn.sai.msu.ru
Whole thread Raw
In response to Re: Real-life range datasets  (Benedikt Grundmann <bgrundmann@janestreet.com>)
Responses Re: Real-life range datasets
List pgsql-hackers
Bene,

we have pgfoundry project http://pgfoundry.org/projects/dbsamples/.
Since your sample database is very important (for me also), I suggest to use
this site.

Oleg
On Thu, 22 Dec 2011, Benedikt Grundmann wrote:

> Hello,
>
> We have a table in a postgres 8.4 database that would make use of date
> ranges and exclusion constraints if they were available.  Sadly I cannot
> give you the data as it is based on data we are paying for and as part
> of the relevant licenses we are obliqued to not give the data to third
> parties.
>
> Basically the tables keep meta data about financial instruments on
> a given day.  Thanks to corporate actions (companies merging, splitting
> up, etc...) that meta data can be different from one day to the next.
>
> One way to model such a table is:
>
> identifier, date, payload columns...
>
>
> unique index on (date, identifier)
>
> (and in fact some of the payload columns are unique per day indices
> as well).
>
> But because there are a large number of rows per day and most don't
> change this is a very wasteful representation.
>
> Instead we use this
>
> identifier, effective_from, effective_until, payload columns...
>
> And we have some clever plpgsql functions that merge a days snapshot
> into that representation.  That happens only a few times per day and
> is currently quite slow mostly because a lot of time is spend in
> validation triggers to check that there are no overlapping entries
> for effective_from,effective_until for jane_symbol and a few other
> identifiers.
>
> The most common operations are:
>
>  Get all or most rows of a given day
>
> (select ... from instruments where :date between effective_from and effective_until)
>
> left join of the instruments (again in the normal case constrained to
> one day but in same cases periods of a week or a few month)
>
> select ... from t left join instruments on
>  t.jane_symbol = instruments.jane_symbol
>  t.date between instruments.effective_from and t.effective_until
>  where t.date = X
>    and additional constraint on the number of rows from t
>
> With t a huge table clustered on date with roughly 500,000 to 2,000,000
> entries per day.  The left join would work most of the time (my guess is
> more than 90%).  But there are entries in t where the jane_symbol would
> not be in instruments (sadly).
>
> Current size (immediately after a cluster):
>
>  table     toast        (all indices)   total
> | 1268 MB | 900 MB     | 693 MB        | 2861 MB
>
> => select min(effective_from), max(effective_from) from instruments;
>    min     |    max
> ------------+------------
> 2011-05-30 | 2011-12-21
> (1 row)
>
> b=> select count(*) from instruments where current_date - 1 between effective_from and effective_until ;
> count
> --------
> 358741
> (1 row)
>
> I should be able to give you a table with the same characteristics as
> the instruments table but bogus data by replacing all entries in the
> table with random strings of the same length or something like that.
> I can probably take a little bit of time during this or the next week
> to generate such "fake" real world data ;-)   Is there an ftp site to
> upload the gzipped pg_dump file to?
>
> Cheers,
>
> Bene
>
> On 20/12/11 16:48, Alexander Korotkov wrote:
>> Hackers,
>>
>> For better GiST indexing of range types it's important to have real-life
>> datasets for testing on. Real-life range datasets would help to proof (or
>> reject) some concepts and get more realistic benchmarks. Also, it would be
>> nice to know what queries you expect to run fast on that datasets. Ideally
>> it should be real-life set of queries, but it also could be your
>> presentation of what are typical queries  for such datasets.
>> Thanks!
>>
>> -----
>> With best regards,
>> Alexander Korotkov.
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Typed hstore proposal
Next
From: Benedikt Grundmann
Date:
Subject: Re: Typed hstore proposal