Thread: Matching unique primary keys

Matching unique primary keys

From
Kevin Old
Date:
Hello all,

I have a table of data that is very large (502 columns) and whenever I
query it, I only query on up to 4 columns and on average only 20 fields
are returned.  I've come up with a way to increase the speed of the
queries.  I could could put the 20 fields that are searched and returned
into a separate table and then link them by a unique id.  Sounds
easy......but I'm stuck on something.

I am dealing with about 1 million records per day.

One option is to put a sequence on the tables, but dealing with so many
records I'd have to use a BIGINT and with a ceiling of
9223372036854775807 it seems to me that numbers this large might slow
down the query process.  I realize it would take quite a while to get to
this point, but would like other's opinions.

Another option is that I have 3 fields that when combine, make each
record unique.  Is there some way I can combine these dynamically and
then use "views" of them to reference my records and display them from
the larger table.

I realize this is very confusing, please get back with me if I should
elaborate on something.

Any help is appreciated.

Thanks,
Kevin



Re: Matching unique primary keys

From
Richard Huxton
Date:
On Tuesday 22 Oct 2002 9:12 pm, Kevin Old wrote:
> Hello all,
>
> I have a table of data that is very large (502 columns) and whenever I
> query it, I only query on up to 4 columns and on average only 20 fields
> are returned.  I've come up with a way to increase the speed of the
> queries.  I could could put the 20 fields that are searched and returned
> into a separate table and then link them by a unique id.  Sounds
> easy......but I'm stuck on something.
>
> I am dealing with about 1 million records per day.

Just to clarify - that's a table that's expanding by 1 million rows a day?

> One option is to put a sequence on the tables, but dealing with so many
> records I'd have to use a BIGINT and with a ceiling of
> 9223372036854775807 it seems to me that numbers this large might slow
> down the query process.  I realize it would take quite a while to get to
> this point, but would like other's opinions.

Well an int4 would give you enough unique id's for about 4000 days, which may
or may not be enough for you. One thing I'd say is that querying a table with
4 billion rows with 500 columns is going to require a hefty machine.

Using a bigint (int8) means practically unlimited expansion, but handling int8
is slower than int4 on a 32-bit system. Details will depend on your hardware,
and on 64-bit systems I'd expect no difference (but test - I haven't).

Having said that, I can't believe the impact is relevant when compared with
the overhead of 500 columns.

> Another option is that I have 3 fields that when combine, make each
> record unique.  Is there some way I can combine these dynamically and
> then use "views" of them to reference my records and display them from
> the larger table.

So - those 3 fields are a candidate key, and in the absence of any other info
are your primary key. A first design would be to repeat these 3 fields in
each table (if you split the 500 column one) and then if performance sucks
add a sequence to the "core" table and use that as a key instead.

I'd think it's unlikely that your table should (in design terms) be over 500
columns wide. If lots of these values are defaults or nulls then that's a
sign that things need to be split up.

If the 20 columns you mention being queried against are logically similar then
that should form your foo_core table. If splitting the columns into logical
groups can be done, that's where I'd start. If nothing else, a logical split
is less likely to need changes in the future - an important consideration
once you have a few hundred million records.

From a performance point of view, I'd think it would be helpful to have your 4
query columns in one table and your 20 results columns in one table (possibly
the same table). However, if they aren't logically related I'd be tempted to
still split things logically and set up a separate foo_query table and
maintain it using triggers.

> I realize this is very confusing, please get back with me if I should
> elaborate on something.

Difficult for anyone on the list to give specific advice without knowing your
hardware, performance requirements and schema details. If my post hasn't
helped, try another message with more details.

Bruce Momjian has written some notes on the internals of PG - how it accesses
disk blocks and interacts with the OS disk cache. They're either in the
developers section of the website or on techdocs.postgresql.org - can't
remember which sorry. Might give you something to think about while you're
looking at this problem.

--
  Richard Huxton

Re: Matching unique primary keys

From
Kevin Old
Date:
Richard,

Thanks for your reply.  I realized how confusing my post sounds, so I
will attempt to clear things up.

On Wed, 2002-10-23 at 07:26, Richard Huxton wrote:
> On Tuesday 22 Oct 2002 9:12 pm, Kevin Old wrote:
> > Hello all,
> >
> > I have a table of data that is very large (502 columns) and whenever I
> > query it, I only query on up to 4 columns and on average only 20 fields
> > are returned.  I've come up with a way to increase the speed of the
> > queries.  I could could put the 20 fields that are searched and returned
> > into a separate table and then link them by a unique id.  Sounds
> > easy......but I'm stuck on something.
> >
> > I am dealing with about 1 million records per day.
>
> Just to clarify - that's a table that's expanding by 1 million rows a day?

Well, I am battling two solutions.

The first option is to do just that, store everything in one huge
table.  The downfall is much slower queries as the data grows.

The other option (and one I'm currently using), store each hours data in
a table of it's own and then when querying the data union the queries
together (if the query spans across mutiple hours).
The only downfall I can see would be the increasing number of queries
unioned if the hourly search window [the user selects] grows.

On average I store 50,000 - 150,000 records per hour.

The type of data I am storing is cellular telephone records.  I receive
a datafeed of records in pipe delimited format, which I am to make
searchable to users via a web interface.  While I have this working, the
search time for a few records over 2 hours is around 35 seconds.

We are running a Sun 450 with 4 800Mhz processors and 2GB of RAM.


>
> > One option is to put a sequence on the tables, but dealing with so many
> > records I'd have to use a BIGINT and with a ceiling of
> > 9223372036854775807 it seems to me that numbers this large might slow
> > down the query process.  I realize it would take quite a while to get to
> > this point, but would like other's opinions.
>
> Well an int4 would give you enough unique id's for about 4000 days, which may
> or may not be enough for you. One thing I'd say is that querying a table with
> 4 billion rows with 500 columns is going to require a hefty machine.
>
> Using a bigint (int8) means practically unlimited expansion, but handling int8
> is slower than int4 on a 32-bit system. Details will depend on your hardware,
> and on 64-bit systems I'd expect no difference (but test - I haven't).
>
> Having said that, I can't believe the impact is relevant when compared with
> the overhead of 500 columns.

Yes, I agree.  I guess I just missed the obvious.

>
> > Another option is that I have 3 fields that when combine, make each
> > record unique.  Is there some way I can combine these dynamically and
> > then use "views" of them to reference my records and display them from
> > the larger table.
>
> So - those 3 fields are a candidate key, and in the absence of any other info
> are your primary key. A first design would be to repeat these 3 fields in
> each table (if you split the 500 column one) and then if performance sucks
> add a sequence to the "core" table and use that as a key instead.
>
> I'd think it's unlikely that your table should (in design terms) be over 500
> columns wide. If lots of these values are defaults or nulls then that's a
> sign that things need to be split up.
>
> If the 20 columns you mention being queried against are logically similar then
> that should form your foo_core table. If splitting the columns into logical
> groups can be done, that's where I'd start. If nothing else, a logical split
> is less likely to need changes in the future - an important consideration
> once you have a few hundred million records.
>
> >From a performance point of view, I'd think it would be helpful to have your 4
> query columns in one table and your 20 results columns in one table (possibly
> the same table). However, if they aren't logically related I'd be tempted to
> still split things logically and set up a separate foo_query table and
> maintain it using triggers.

I agree with the design terms that a table should reach 500+ columns.
But, they track a ton of data about every cell phone call.  Yes, a lot
of the fields are zeros, but they track things like electronics, etc and
will only show data if something is wrong.  I have done testing with
placing the 20 coulmns in a separate table and searching on that table
is lightening fast on the same search as above that took 35 seconds.  It
was so fast I had to have my code run the query 100 times in order to
get a vaild Benchmark time from the Perl module!!

My main problem is the best way to track the data once it is broken up
into separate tables.  Possible solutions I have are:

1) Copy the 3 "unique fields" into every table (as you mention above).
My only question with this is how am I to do a join on a multicolumn
primary key?  The only thing I can think of would be something like:

SELECT *
FROM data_index, data_tbl
WHERE data_index.id1 = data_tbl.id1
AND data_index.id2 = data_tbl.id2
AND data_index.id3 = data_tbl.id3

Seems inefficient to me to do it this way because of the 3 joins.  Maybe
I'm wrong?

2) Put a sequence on the data_index (20 fields) and the data_tbl (all
other fields), storing one hour of data per table.  The sequence would
be a int4 on each table.  The appropriate data would be inserted in
their respective tables and the sequences would be kept on the same
number.

I have a few questions on the sequence issue.  My data is only inserted
once.  No updates.  I want to be able to store and search as much data
as I can.

There are 3 ways I see to store this data:

1) I can have a master index table where the 20 fields will be stored
for all hours.  This will become a large table quickly, but will only be
20 fields long.

This will allow me to query one table for everything, but when data from
the the other 480 fields is needed I'd have to somehow link the data
from the master index table to the appropriate data table....could get
messy.

2) I can have a table to keep up with the date, time, hourly index
table, and hourly data table.  I can use this as an aid in building my
unioned query.  The tables are named with date and hour in them.
Assuming the table looks like:

date date,
time time,
hr_idx_tbl char(16),
hr_data_tbl char(16)

Building a [union] query could be like:

SELECT hr_idx_tbl
FROM master_tbl_idx
WHERE CONCAT(DATE," ",TIME) BETWEEN ("2002/10/23 04:00:00" AND
"2002/10/23 09:00:00")

The results would be index tables that I'd need to query.  I'd then
generate a select for each table, and then union them together.

3) I can do the above, but store the data in tables by day instead of
hours.


Ok, I think I've explained it as much as I can.

On other note, how would I keep the sequences the same in both the index
table and the data table?  In a transaction, can I insert the 480 fields
using the nextval() for the data table index field and then have another
insert for the 20 fields and use currval() for the index table index
field?  Any other suggestions on how I'd do this?

With all that said, I'd appreciate any help you can offer.

Any help is appreciated!

Kevin
kold@carolina.rr.com


Re: Matching unique primary keys

From
Richard Huxton
Date:
On Wednesday 23 Oct 2002 4:39 pm, Kevin Old wrote:
> Richard,
>
> Thanks for your reply.  I realized how confusing my post sounds, so I
> will attempt to clear things up.

Not confusing, but perhaps a little short. It's a tricky balancing act though
- 8K of problem and people will skip past.

> > > I have a table of data that is very large (502 columns) and whenever I
> > > query it, I only query on up to 4 columns and on average only 20 fields
> > > are returned.  I've come up with a way to increase the speed of the
> > > queries.  I could could put the 20 fields that are searched and
> > > returned into a separate table and then link them by a unique id.
> > > Sounds easy......but I'm stuck on something.
> > >
> > > I am dealing with about 1 million records per day.

> On average I store 50,000 - 150,000 records per hour.
>
> The type of data I am storing is cellular telephone records.  I receive
> a datafeed of records in pipe delimited format, which I am to make
> searchable to users via a web interface.  While I have this working, the
> search time for a few records over 2 hours is around 35 seconds.

Yuck - are these aggregate queries ("show me the totals per hour") or detail
queries ("give me a list of all calls between 0800 and 1130")?

> We are running a Sun 450 with 4 800Mhz processors and 2GB of RAM.

Good - someone will know enough about that setup to comment usefully. Me,
sorry no.

> > I'd think it's unlikely that your table should (in design terms) be over
> > 500 columns wide. If lots of these values are defaults or nulls then
> > that's a sign that things need to be split up.

> I agree with the design terms that a table should reach 500+ columns.
> But, they track a ton of data about every cell phone call.  Yes, a lot
> of the fields are zeros, but they track things like electronics, etc and
> will only show data if something is wrong.

If 99% of your data is saying nothing, don't record it. Split it out to a
separate table and only record a row where there's something useful. If you
had to do it by hand, watching a big bank of lights and writing things down
in a book you'd only bother if they were red.

> I have done testing with
> placing the 20 coulmns in a separate table and searching on that table
> is lightening fast on the same search as above that took 35 seconds.  It
> was so fast I had to have my code run the query 100 times in order to
> get a vaild Benchmark time from the Perl module!!

Right - so you've got your solution then.

> My main problem is the best way to track the data once it is broken up
> into separate tables.  Possible solutions I have are:
>
> 1) Copy the 3 "unique fields" into every table (as you mention above).
> My only question with this is how am I to do a join on a multicolumn
> primary key?  The only thing I can think of would be something like:
>
> SELECT *
> FROM data_index, data_tbl
> WHERE data_index.id1 = data_tbl.id1
> AND data_index.id2 = data_tbl.id2
> AND data_index.id3 = data_tbl.id3
>
> Seems inefficient to me to do it this way because of the 3 joins.  Maybe
> I'm wrong?

No - that's what you do. It's slower than joining on a single int4, but the
big advantage is that each table then means something by itself. The
"secondary" tables can be split out/exported/whatever and you can see at a
glance what they refer too.

Having said that, 3 columns means more typing and plenty of people replace
multi-column keys with an auto-incrementing ID. Keep the unique index on the
3 columns though.

> 2) Put a sequence on the data_index (20 fields) and the data_tbl (all
> other fields), storing one hour of data per table.  The sequence would
> be a int4 on each table.  The appropriate data would be inserted in
> their respective tables and the sequences would be kept on the same
> number.
>
> I have a few questions on the sequence issue.  My data is only inserted
> once.  No updates.  I want to be able to store and search as much data
> as I can.
>
> There are 3 ways I see to store this data:
>
> 1) I can have a master index table where the 20 fields will be stored
> for all hours.  This will become a large table quickly, but will only be
> 20 fields long.
>
> This will allow me to query one table for everything, but when data from
> the the other 480 fields is needed I'd have to somehow link the data
> from the master index table to the appropriate data table....could get
> messy.

Not necessarily - if there are patterns to queries, you can create a view so
it looks like a simple query to the application. SQL is forgiving on
whitespace too, so laying out a query over several lines with indents helps a
lot too.

BTW - 480 fields still seems very wide - why not split it into half a dozen
tables based on what the fields mean?

> 2) I can have a table to keep up with the date, time, hourly index
> table, and hourly data table.  I can use this as an aid in building my
> unioned query.  The tables are named with date and hour in them.
> Assuming the table looks like:
>
> date date,
> time time,
> hr_idx_tbl char(16),
> hr_data_tbl char(16)
>
> Building a [union] query could be like:
>
> SELECT hr_idx_tbl
> FROM master_tbl_idx
> WHERE CONCAT(DATE," ",TIME) BETWEEN ("2002/10/23 04:00:00" AND
> "2002/10/23 09:00:00")
>
> The results would be index tables that I'd need to query.  I'd then
> generate a select for each table, and then union them together.

If this is historical data, I'd go for two tables - "recent" and "old" - most
queries will be on the recent table (because that's how you're defining
recent) and if you need to go further back there's only two tables to union.
Of course, your situation may vary.

> 3) I can do the above, but store the data in tables by day instead of
> hours.

You'll have to test. It's going do depend on if people query for specific
hours or for totals for January.

> Ok, I think I've explained it as much as I can.
>
> On other note, how would I keep the sequences the same in both the index
> table and the data table?  In a transaction, can I insert the 480 fields
> using the nextval() for the data table index field and then have another
> insert for the 20 fields and use currval() for the index table index
> field?  Any other suggestions on how I'd do this?

That's the way. Note that you could do several of these in one transaction
which would help with performance on inserts.

BTW - when I mentioned splitting the tables, I didn't mean into 20+480 instead
of 500, I meant 20+40+23+56... etc. according to the *meaning* of the data.
If much of the data really isn't worth recording much of the time, I'd
probably have:

CREATE TABLE (
    id1 int4,
    id2 whatever,
    id3 int4,
    item_name varchar(64),
    item_value varchar(16)
);

That means you don't have any data stored you don't need.

> With all that said, I'd appreciate any help you can offer.
>
> Any help is appreciated!

Hope it has been.

PS - another idea you may want to look into is pre-calculating totals per-hour
(or whatever) if that's the sort of thing your users are interested in.

PPS - it might be you don't want a database - if all you're doing is looking
up a few rows based on simple criteria, a flat-file system might be better.
--
  Richard Huxton

Re: Matching unique primary keys

From
"Shridhar Daithankar"
Date:
On 23 Oct 2002 at 12:39, Kevin Old wrote:

> 1) Copy the 3 "unique fields" into every table (as you mention above).
> My only question with this is how am I to do a join on a multicolumn
> primary key?  The only thing I can think of would be something like:
>
> SELECT *
> FROM data_index, data_tbl
> WHERE data_index.id1 = data_tbl.id1
> AND data_index.id2 = data_tbl.id2
> AND data_index.id3 = data_tbl.id3

Well you can store the 3 fields in a table, declare the combination of them as
primary key and add a fourth field that is sequence generated and use this
sequence number as primary key in rest of them tables. Will reuce the IO
requirement by oodles..

HTH

Bye
 Shridhar

--
ASCII:    The control code for all beginning programmers and those who would
become computer literate.  Etymologically, the term has come down as    a
contraction of the often-repeated phrase "ascii and you shall    receive."        --
Robb Russon