Re: Matching unique primary keys - Mailing list pgsql-general

From Kevin Old
Subject Re: Matching unique primary keys
Date
Msg-id 1035391180.3358.146.camel@oc
Whole thread Raw
In response to Re: Matching unique primary keys  (Richard Huxton <dev@archonet.com>)
Responses Re: Matching unique primary keys  (Richard Huxton <dev@archonet.com>)
Re: Matching unique primary keys  ("Shridhar Daithankar" <shridhar_daithankar@persistent.co.in>)
List pgsql-general
Richard,

Thanks for your reply.  I realized how confusing my post sounds, so I
will attempt to clear things up.

On Wed, 2002-10-23 at 07:26, Richard Huxton wrote:
> On Tuesday 22 Oct 2002 9:12 pm, Kevin Old wrote:
> > Hello all,
> >
> > I have a table of data that is very large (502 columns) and whenever I
> > query it, I only query on up to 4 columns and on average only 20 fields
> > are returned.  I've come up with a way to increase the speed of the
> > queries.  I could could put the 20 fields that are searched and returned
> > into a separate table and then link them by a unique id.  Sounds
> > easy......but I'm stuck on something.
> >
> > I am dealing with about 1 million records per day.
>
> Just to clarify - that's a table that's expanding by 1 million rows a day?

Well, I am battling two solutions.

The first option is to do just that, store everything in one huge
table.  The downfall is much slower queries as the data grows.

The other option (and one I'm currently using), store each hours data in
a table of it's own and then when querying the data union the queries
together (if the query spans across mutiple hours).
The only downfall I can see would be the increasing number of queries
unioned if the hourly search window [the user selects] grows.

On average I store 50,000 - 150,000 records per hour.

The type of data I am storing is cellular telephone records.  I receive
a datafeed of records in pipe delimited format, which I am to make
searchable to users via a web interface.  While I have this working, the
search time for a few records over 2 hours is around 35 seconds.

We are running a Sun 450 with 4 800Mhz processors and 2GB of RAM.


>
> > One option is to put a sequence on the tables, but dealing with so many
> > records I'd have to use a BIGINT and with a ceiling of
> > 9223372036854775807 it seems to me that numbers this large might slow
> > down the query process.  I realize it would take quite a while to get to
> > this point, but would like other's opinions.
>
> Well an int4 would give you enough unique id's for about 4000 days, which may
> or may not be enough for you. One thing I'd say is that querying a table with
> 4 billion rows with 500 columns is going to require a hefty machine.
>
> Using a bigint (int8) means practically unlimited expansion, but handling int8
> is slower than int4 on a 32-bit system. Details will depend on your hardware,
> and on 64-bit systems I'd expect no difference (but test - I haven't).
>
> Having said that, I can't believe the impact is relevant when compared with
> the overhead of 500 columns.

Yes, I agree.  I guess I just missed the obvious.

>
> > Another option is that I have 3 fields that when combine, make each
> > record unique.  Is there some way I can combine these dynamically and
> > then use "views" of them to reference my records and display them from
> > the larger table.
>
> So - those 3 fields are a candidate key, and in the absence of any other info
> are your primary key. A first design would be to repeat these 3 fields in
> each table (if you split the 500 column one) and then if performance sucks
> add a sequence to the "core" table and use that as a key instead.
>
> I'd think it's unlikely that your table should (in design terms) be over 500
> columns wide. If lots of these values are defaults or nulls then that's a
> sign that things need to be split up.
>
> If the 20 columns you mention being queried against are logically similar then
> that should form your foo_core table. If splitting the columns into logical
> groups can be done, that's where I'd start. If nothing else, a logical split
> is less likely to need changes in the future - an important consideration
> once you have a few hundred million records.
>
> >From a performance point of view, I'd think it would be helpful to have your 4
> query columns in one table and your 20 results columns in one table (possibly
> the same table). However, if they aren't logically related I'd be tempted to
> still split things logically and set up a separate foo_query table and
> maintain it using triggers.

I agree with the design terms that a table should reach 500+ columns.
But, they track a ton of data about every cell phone call.  Yes, a lot
of the fields are zeros, but they track things like electronics, etc and
will only show data if something is wrong.  I have done testing with
placing the 20 coulmns in a separate table and searching on that table
is lightening fast on the same search as above that took 35 seconds.  It
was so fast I had to have my code run the query 100 times in order to
get a vaild Benchmark time from the Perl module!!

My main problem is the best way to track the data once it is broken up
into separate tables.  Possible solutions I have are:

1) Copy the 3 "unique fields" into every table (as you mention above).
My only question with this is how am I to do a join on a multicolumn
primary key?  The only thing I can think of would be something like:

SELECT *
FROM data_index, data_tbl
WHERE data_index.id1 = data_tbl.id1
AND data_index.id2 = data_tbl.id2
AND data_index.id3 = data_tbl.id3

Seems inefficient to me to do it this way because of the 3 joins.  Maybe
I'm wrong?

2) Put a sequence on the data_index (20 fields) and the data_tbl (all
other fields), storing one hour of data per table.  The sequence would
be a int4 on each table.  The appropriate data would be inserted in
their respective tables and the sequences would be kept on the same
number.

I have a few questions on the sequence issue.  My data is only inserted
once.  No updates.  I want to be able to store and search as much data
as I can.

There are 3 ways I see to store this data:

1) I can have a master index table where the 20 fields will be stored
for all hours.  This will become a large table quickly, but will only be
20 fields long.

This will allow me to query one table for everything, but when data from
the the other 480 fields is needed I'd have to somehow link the data
from the master index table to the appropriate data table....could get
messy.

2) I can have a table to keep up with the date, time, hourly index
table, and hourly data table.  I can use this as an aid in building my
unioned query.  The tables are named with date and hour in them.
Assuming the table looks like:

date date,
time time,
hr_idx_tbl char(16),
hr_data_tbl char(16)

Building a [union] query could be like:

SELECT hr_idx_tbl
FROM master_tbl_idx
WHERE CONCAT(DATE," ",TIME) BETWEEN ("2002/10/23 04:00:00" AND
"2002/10/23 09:00:00")

The results would be index tables that I'd need to query.  I'd then
generate a select for each table, and then union them together.

3) I can do the above, but store the data in tables by day instead of
hours.


Ok, I think I've explained it as much as I can.

On other note, how would I keep the sequences the same in both the index
table and the data table?  In a transaction, can I insert the 480 fields
using the nextval() for the data table index field and then have another
insert for the 20 fields and use currval() for the index table index
field?  Any other suggestions on how I'd do this?

With all that said, I'd appreciate any help you can offer.

Any help is appreciated!

Kevin
kold@carolina.rr.com


pgsql-general by date:

Previous
From: "Erwan DUROSELLE"
Date:
Subject: Rép. : Re: Alter table syntax
Next
From: Alexandr Kovalenko
Date:
Subject: --enable-multibyte=UNICODE, LC_COLLATE and friends