Re: Matching unique primary keys - Mailing list pgsql-general
From | Kevin Old |
---|---|
Subject | Re: Matching unique primary keys |
Date | |
Msg-id | 1035391180.3358.146.camel@oc Whole thread Raw |
In response to | Re: Matching unique primary keys (Richard Huxton <dev@archonet.com>) |
Responses |
Re: Matching unique primary keys
Re: Matching unique primary keys |
List | pgsql-general |
Richard, Thanks for your reply. I realized how confusing my post sounds, so I will attempt to clear things up. On Wed, 2002-10-23 at 07:26, Richard Huxton wrote: > On Tuesday 22 Oct 2002 9:12 pm, Kevin Old wrote: > > Hello all, > > > > I have a table of data that is very large (502 columns) and whenever I > > query it, I only query on up to 4 columns and on average only 20 fields > > are returned. I've come up with a way to increase the speed of the > > queries. I could could put the 20 fields that are searched and returned > > into a separate table and then link them by a unique id. Sounds > > easy......but I'm stuck on something. > > > > I am dealing with about 1 million records per day. > > Just to clarify - that's a table that's expanding by 1 million rows a day? Well, I am battling two solutions. The first option is to do just that, store everything in one huge table. The downfall is much slower queries as the data grows. The other option (and one I'm currently using), store each hours data in a table of it's own and then when querying the data union the queries together (if the query spans across mutiple hours). The only downfall I can see would be the increasing number of queries unioned if the hourly search window [the user selects] grows. On average I store 50,000 - 150,000 records per hour. The type of data I am storing is cellular telephone records. I receive a datafeed of records in pipe delimited format, which I am to make searchable to users via a web interface. While I have this working, the search time for a few records over 2 hours is around 35 seconds. We are running a Sun 450 with 4 800Mhz processors and 2GB of RAM. > > > One option is to put a sequence on the tables, but dealing with so many > > records I'd have to use a BIGINT and with a ceiling of > > 9223372036854775807 it seems to me that numbers this large might slow > > down the query process. I realize it would take quite a while to get to > > this point, but would like other's opinions. > > Well an int4 would give you enough unique id's for about 4000 days, which may > or may not be enough for you. One thing I'd say is that querying a table with > 4 billion rows with 500 columns is going to require a hefty machine. > > Using a bigint (int8) means practically unlimited expansion, but handling int8 > is slower than int4 on a 32-bit system. Details will depend on your hardware, > and on 64-bit systems I'd expect no difference (but test - I haven't). > > Having said that, I can't believe the impact is relevant when compared with > the overhead of 500 columns. Yes, I agree. I guess I just missed the obvious. > > > Another option is that I have 3 fields that when combine, make each > > record unique. Is there some way I can combine these dynamically and > > then use "views" of them to reference my records and display them from > > the larger table. > > So - those 3 fields are a candidate key, and in the absence of any other info > are your primary key. A first design would be to repeat these 3 fields in > each table (if you split the 500 column one) and then if performance sucks > add a sequence to the "core" table and use that as a key instead. > > I'd think it's unlikely that your table should (in design terms) be over 500 > columns wide. If lots of these values are defaults or nulls then that's a > sign that things need to be split up. > > If the 20 columns you mention being queried against are logically similar then > that should form your foo_core table. If splitting the columns into logical > groups can be done, that's where I'd start. If nothing else, a logical split > is less likely to need changes in the future - an important consideration > once you have a few hundred million records. > > >From a performance point of view, I'd think it would be helpful to have your 4 > query columns in one table and your 20 results columns in one table (possibly > the same table). However, if they aren't logically related I'd be tempted to > still split things logically and set up a separate foo_query table and > maintain it using triggers. I agree with the design terms that a table should reach 500+ columns. But, they track a ton of data about every cell phone call. Yes, a lot of the fields are zeros, but they track things like electronics, etc and will only show data if something is wrong. I have done testing with placing the 20 coulmns in a separate table and searching on that table is lightening fast on the same search as above that took 35 seconds. It was so fast I had to have my code run the query 100 times in order to get a vaild Benchmark time from the Perl module!! My main problem is the best way to track the data once it is broken up into separate tables. Possible solutions I have are: 1) Copy the 3 "unique fields" into every table (as you mention above). My only question with this is how am I to do a join on a multicolumn primary key? The only thing I can think of would be something like: SELECT * FROM data_index, data_tbl WHERE data_index.id1 = data_tbl.id1 AND data_index.id2 = data_tbl.id2 AND data_index.id3 = data_tbl.id3 Seems inefficient to me to do it this way because of the 3 joins. Maybe I'm wrong? 2) Put a sequence on the data_index (20 fields) and the data_tbl (all other fields), storing one hour of data per table. The sequence would be a int4 on each table. The appropriate data would be inserted in their respective tables and the sequences would be kept on the same number. I have a few questions on the sequence issue. My data is only inserted once. No updates. I want to be able to store and search as much data as I can. There are 3 ways I see to store this data: 1) I can have a master index table where the 20 fields will be stored for all hours. This will become a large table quickly, but will only be 20 fields long. This will allow me to query one table for everything, but when data from the the other 480 fields is needed I'd have to somehow link the data from the master index table to the appropriate data table....could get messy. 2) I can have a table to keep up with the date, time, hourly index table, and hourly data table. I can use this as an aid in building my unioned query. The tables are named with date and hour in them. Assuming the table looks like: date date, time time, hr_idx_tbl char(16), hr_data_tbl char(16) Building a [union] query could be like: SELECT hr_idx_tbl FROM master_tbl_idx WHERE CONCAT(DATE," ",TIME) BETWEEN ("2002/10/23 04:00:00" AND "2002/10/23 09:00:00") The results would be index tables that I'd need to query. I'd then generate a select for each table, and then union them together. 3) I can do the above, but store the data in tables by day instead of hours. Ok, I think I've explained it as much as I can. On other note, how would I keep the sequences the same in both the index table and the data table? In a transaction, can I insert the 480 fields using the nextval() for the data table index field and then have another insert for the 20 fields and use currval() for the index table index field? Any other suggestions on how I'd do this? With all that said, I'd appreciate any help you can offer. Any help is appreciated! Kevin kold@carolina.rr.com
pgsql-general by date: