2D partitioning of VLDB - sane or not? - Mailing list pgsql-hackers

From Jason Nerothin
Subject 2D partitioning of VLDB - sane or not?
Date
Msg-id f42b58b90708091434w396fe7e0s418256b29d32ce7d@mail.gmail.com
Whole thread Raw
Responses Re: 2D partitioning of VLDB - sane or not?  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
I am building up a schema for storing a bunch of data about proteins, which on a certain level can be modelled with
quitesimple tables. The problem is that the database I am building needs to house lots of it >10TB and growing, with
onetable in particular threatening to top 1TB. In the case of the table and in the case of the overall database, the
sizecan be expected to grow quickly (and most of it can never be deleted). <br /><br />In the past, with smaller
tables,I have had success partitioning on a 64-bit crc hash that takes a more or less uniform distribution of input
dataand pumps out a more-or-less uniform distribution of partitioned data with a very small probability of collision.
Thehash itself is implemented as a c add-on library, returns a BIGINT and serves as a candidate key for what for our
purposeswe can call a protein record. <br /><br />Now back to the big table, which relates two of these records (in a
theoreticallysymmetric way). Assuming I set the the table up as something like:<br /><br />CREATE TABLE
big_protein_relation_partition_dimA_dimB{<br/> protein_id_a BIGINTEGER NOT NULL CHECK( bin_num(protein_id_a) = dimA ),
---key (hash) from some table<br />protein_id_a BIGINTEGER NOT NULL CHECK( bin_num(protein_id_b) = dimB ), --- key
(hash)from some table<br />...<br />} <br /><br />and do a little c bit-twiddling and define some binning mechanism on
theBIGINTEGERs.<br /><br />As near I can tell, binning out along the A and B dimensions into 256 bins, I shouldn't be
inany danger of running out of OIDs or anything like that (despite having to deal with 2^16 tables). Theoretically, at
least,I should be able to do UNIONS along each axis (to avoid causing the analyzer too much overhead) and use range
exclusionto make my queries zip along with proper indexing. <br /><br />Aside from running into a known bug with "too
manytriggers" when creating gratuitous indices on these tables, I feel as it may be possible to do what I want without
breakingeverything. But then again, am I taking too many liberties with technology that maybe didn't have use cases
likethis one in mind? <br /><br />Jason<br /><br />-- <br />========================================================<br
/>JasonNerothin<br />Programmer/Analyst IV - Database Administration<br />UCLA-DOE Institute for Genomics &
Proteomics<br/>Howard Hughes Medical Institute <br />========================================================<br />611
C.E.Young Drive East   | Tel: (310) 206-3907<br />105 Boyer Hall, Box 951570  | Fax: (310) 206-3914<br />Los Angeles,
CA90095. USA | Mail: <a href="mailto:jason@mbi.ucla.edu"> jason@mbi.ucla.edu</a><br
/>========================================================<br/><a
href="http://www.mbi.ucla.edu/~jason">http://www.mbi.ucla.edu/~jason</a><br
/>======================================================== 

pgsql-hackers by date:

Previous
From: Chris Mair
Date:
Subject: Re: RIP: Buildfarm member Baiji ??
Next
From: Robert Treat
Date:
Subject: Re: crypting prosrc in pg_proc