Thread: Matching unique primary keys
Hello all, I have a table of data that is very large (502 columns) and whenever I query it, I only query on up to 4 columns and on average only 20 fields are returned. I've come up with a way to increase the speed of the queries. I could could put the 20 fields that are searched and returned into a separate table and then link them by a unique id. Sounds easy......but I'm stuck on something. I am dealing with about 1 million records per day. One option is to put a sequence on the tables, but dealing with so many records I'd have to use a BIGINT and with a ceiling of 9223372036854775807 it seems to me that numbers this large might slow down the query process. I realize it would take quite a while to get to this point, but would like other's opinions. Another option is that I have 3 fields that when combine, make each record unique. Is there some way I can combine these dynamically and then use "views" of them to reference my records and display them from the larger table. I realize this is very confusing, please get back with me if I should elaborate on something. Any help is appreciated. Thanks, Kevin
On Tuesday 22 Oct 2002 9:12 pm, Kevin Old wrote: > Hello all, > > I have a table of data that is very large (502 columns) and whenever I > query it, I only query on up to 4 columns and on average only 20 fields > are returned. I've come up with a way to increase the speed of the > queries. I could could put the 20 fields that are searched and returned > into a separate table and then link them by a unique id. Sounds > easy......but I'm stuck on something. > > I am dealing with about 1 million records per day. Just to clarify - that's a table that's expanding by 1 million rows a day? > One option is to put a sequence on the tables, but dealing with so many > records I'd have to use a BIGINT and with a ceiling of > 9223372036854775807 it seems to me that numbers this large might slow > down the query process. I realize it would take quite a while to get to > this point, but would like other's opinions. Well an int4 would give you enough unique id's for about 4000 days, which may or may not be enough for you. One thing I'd say is that querying a table with 4 billion rows with 500 columns is going to require a hefty machine. Using a bigint (int8) means practically unlimited expansion, but handling int8 is slower than int4 on a 32-bit system. Details will depend on your hardware, and on 64-bit systems I'd expect no difference (but test - I haven't). Having said that, I can't believe the impact is relevant when compared with the overhead of 500 columns. > Another option is that I have 3 fields that when combine, make each > record unique. Is there some way I can combine these dynamically and > then use "views" of them to reference my records and display them from > the larger table. So - those 3 fields are a candidate key, and in the absence of any other info are your primary key. A first design would be to repeat these 3 fields in each table (if you split the 500 column one) and then if performance sucks add a sequence to the "core" table and use that as a key instead. I'd think it's unlikely that your table should (in design terms) be over 500 columns wide. If lots of these values are defaults or nulls then that's a sign that things need to be split up. If the 20 columns you mention being queried against are logically similar then that should form your foo_core table. If splitting the columns into logical groups can be done, that's where I'd start. If nothing else, a logical split is less likely to need changes in the future - an important consideration once you have a few hundred million records. From a performance point of view, I'd think it would be helpful to have your 4 query columns in one table and your 20 results columns in one table (possibly the same table). However, if they aren't logically related I'd be tempted to still split things logically and set up a separate foo_query table and maintain it using triggers. > I realize this is very confusing, please get back with me if I should > elaborate on something. Difficult for anyone on the list to give specific advice without knowing your hardware, performance requirements and schema details. If my post hasn't helped, try another message with more details. Bruce Momjian has written some notes on the internals of PG - how it accesses disk blocks and interacts with the OS disk cache. They're either in the developers section of the website or on techdocs.postgresql.org - can't remember which sorry. Might give you something to think about while you're looking at this problem. -- Richard Huxton
Richard, Thanks for your reply. I realized how confusing my post sounds, so I will attempt to clear things up. On Wed, 2002-10-23 at 07:26, Richard Huxton wrote: > On Tuesday 22 Oct 2002 9:12 pm, Kevin Old wrote: > > Hello all, > > > > I have a table of data that is very large (502 columns) and whenever I > > query it, I only query on up to 4 columns and on average only 20 fields > > are returned. I've come up with a way to increase the speed of the > > queries. I could could put the 20 fields that are searched and returned > > into a separate table and then link them by a unique id. Sounds > > easy......but I'm stuck on something. > > > > I am dealing with about 1 million records per day. > > Just to clarify - that's a table that's expanding by 1 million rows a day? Well, I am battling two solutions. The first option is to do just that, store everything in one huge table. The downfall is much slower queries as the data grows. The other option (and one I'm currently using), store each hours data in a table of it's own and then when querying the data union the queries together (if the query spans across mutiple hours). The only downfall I can see would be the increasing number of queries unioned if the hourly search window [the user selects] grows. On average I store 50,000 - 150,000 records per hour. The type of data I am storing is cellular telephone records. I receive a datafeed of records in pipe delimited format, which I am to make searchable to users via a web interface. While I have this working, the search time for a few records over 2 hours is around 35 seconds. We are running a Sun 450 with 4 800Mhz processors and 2GB of RAM. > > > One option is to put a sequence on the tables, but dealing with so many > > records I'd have to use a BIGINT and with a ceiling of > > 9223372036854775807 it seems to me that numbers this large might slow > > down the query process. I realize it would take quite a while to get to > > this point, but would like other's opinions. > > Well an int4 would give you enough unique id's for about 4000 days, which may > or may not be enough for you. One thing I'd say is that querying a table with > 4 billion rows with 500 columns is going to require a hefty machine. > > Using a bigint (int8) means practically unlimited expansion, but handling int8 > is slower than int4 on a 32-bit system. Details will depend on your hardware, > and on 64-bit systems I'd expect no difference (but test - I haven't). > > Having said that, I can't believe the impact is relevant when compared with > the overhead of 500 columns. Yes, I agree. I guess I just missed the obvious. > > > Another option is that I have 3 fields that when combine, make each > > record unique. Is there some way I can combine these dynamically and > > then use "views" of them to reference my records and display them from > > the larger table. > > So - those 3 fields are a candidate key, and in the absence of any other info > are your primary key. A first design would be to repeat these 3 fields in > each table (if you split the 500 column one) and then if performance sucks > add a sequence to the "core" table and use that as a key instead. > > I'd think it's unlikely that your table should (in design terms) be over 500 > columns wide. If lots of these values are defaults or nulls then that's a > sign that things need to be split up. > > If the 20 columns you mention being queried against are logically similar then > that should form your foo_core table. If splitting the columns into logical > groups can be done, that's where I'd start. If nothing else, a logical split > is less likely to need changes in the future - an important consideration > once you have a few hundred million records. > > >From a performance point of view, I'd think it would be helpful to have your 4 > query columns in one table and your 20 results columns in one table (possibly > the same table). However, if they aren't logically related I'd be tempted to > still split things logically and set up a separate foo_query table and > maintain it using triggers. I agree with the design terms that a table should reach 500+ columns. But, they track a ton of data about every cell phone call. Yes, a lot of the fields are zeros, but they track things like electronics, etc and will only show data if something is wrong. I have done testing with placing the 20 coulmns in a separate table and searching on that table is lightening fast on the same search as above that took 35 seconds. It was so fast I had to have my code run the query 100 times in order to get a vaild Benchmark time from the Perl module!! My main problem is the best way to track the data once it is broken up into separate tables. Possible solutions I have are: 1) Copy the 3 "unique fields" into every table (as you mention above). My only question with this is how am I to do a join on a multicolumn primary key? The only thing I can think of would be something like: SELECT * FROM data_index, data_tbl WHERE data_index.id1 = data_tbl.id1 AND data_index.id2 = data_tbl.id2 AND data_index.id3 = data_tbl.id3 Seems inefficient to me to do it this way because of the 3 joins. Maybe I'm wrong? 2) Put a sequence on the data_index (20 fields) and the data_tbl (all other fields), storing one hour of data per table. The sequence would be a int4 on each table. The appropriate data would be inserted in their respective tables and the sequences would be kept on the same number. I have a few questions on the sequence issue. My data is only inserted once. No updates. I want to be able to store and search as much data as I can. There are 3 ways I see to store this data: 1) I can have a master index table where the 20 fields will be stored for all hours. This will become a large table quickly, but will only be 20 fields long. This will allow me to query one table for everything, but when data from the the other 480 fields is needed I'd have to somehow link the data from the master index table to the appropriate data table....could get messy. 2) I can have a table to keep up with the date, time, hourly index table, and hourly data table. I can use this as an aid in building my unioned query. The tables are named with date and hour in them. Assuming the table looks like: date date, time time, hr_idx_tbl char(16), hr_data_tbl char(16) Building a [union] query could be like: SELECT hr_idx_tbl FROM master_tbl_idx WHERE CONCAT(DATE," ",TIME) BETWEEN ("2002/10/23 04:00:00" AND "2002/10/23 09:00:00") The results would be index tables that I'd need to query. I'd then generate a select for each table, and then union them together. 3) I can do the above, but store the data in tables by day instead of hours. Ok, I think I've explained it as much as I can. On other note, how would I keep the sequences the same in both the index table and the data table? In a transaction, can I insert the 480 fields using the nextval() for the data table index field and then have another insert for the 20 fields and use currval() for the index table index field? Any other suggestions on how I'd do this? With all that said, I'd appreciate any help you can offer. Any help is appreciated! Kevin kold@carolina.rr.com
On Wednesday 23 Oct 2002 4:39 pm, Kevin Old wrote: > Richard, > > Thanks for your reply. I realized how confusing my post sounds, so I > will attempt to clear things up. Not confusing, but perhaps a little short. It's a tricky balancing act though - 8K of problem and people will skip past. > > > I have a table of data that is very large (502 columns) and whenever I > > > query it, I only query on up to 4 columns and on average only 20 fields > > > are returned. I've come up with a way to increase the speed of the > > > queries. I could could put the 20 fields that are searched and > > > returned into a separate table and then link them by a unique id. > > > Sounds easy......but I'm stuck on something. > > > > > > I am dealing with about 1 million records per day. > On average I store 50,000 - 150,000 records per hour. > > The type of data I am storing is cellular telephone records. I receive > a datafeed of records in pipe delimited format, which I am to make > searchable to users via a web interface. While I have this working, the > search time for a few records over 2 hours is around 35 seconds. Yuck - are these aggregate queries ("show me the totals per hour") or detail queries ("give me a list of all calls between 0800 and 1130")? > We are running a Sun 450 with 4 800Mhz processors and 2GB of RAM. Good - someone will know enough about that setup to comment usefully. Me, sorry no. > > I'd think it's unlikely that your table should (in design terms) be over > > 500 columns wide. If lots of these values are defaults or nulls then > > that's a sign that things need to be split up. > I agree with the design terms that a table should reach 500+ columns. > But, they track a ton of data about every cell phone call. Yes, a lot > of the fields are zeros, but they track things like electronics, etc and > will only show data if something is wrong. If 99% of your data is saying nothing, don't record it. Split it out to a separate table and only record a row where there's something useful. If you had to do it by hand, watching a big bank of lights and writing things down in a book you'd only bother if they were red. > I have done testing with > placing the 20 coulmns in a separate table and searching on that table > is lightening fast on the same search as above that took 35 seconds. It > was so fast I had to have my code run the query 100 times in order to > get a vaild Benchmark time from the Perl module!! Right - so you've got your solution then. > My main problem is the best way to track the data once it is broken up > into separate tables. Possible solutions I have are: > > 1) Copy the 3 "unique fields" into every table (as you mention above). > My only question with this is how am I to do a join on a multicolumn > primary key? The only thing I can think of would be something like: > > SELECT * > FROM data_index, data_tbl > WHERE data_index.id1 = data_tbl.id1 > AND data_index.id2 = data_tbl.id2 > AND data_index.id3 = data_tbl.id3 > > Seems inefficient to me to do it this way because of the 3 joins. Maybe > I'm wrong? No - that's what you do. It's slower than joining on a single int4, but the big advantage is that each table then means something by itself. The "secondary" tables can be split out/exported/whatever and you can see at a glance what they refer too. Having said that, 3 columns means more typing and plenty of people replace multi-column keys with an auto-incrementing ID. Keep the unique index on the 3 columns though. > 2) Put a sequence on the data_index (20 fields) and the data_tbl (all > other fields), storing one hour of data per table. The sequence would > be a int4 on each table. The appropriate data would be inserted in > their respective tables and the sequences would be kept on the same > number. > > I have a few questions on the sequence issue. My data is only inserted > once. No updates. I want to be able to store and search as much data > as I can. > > There are 3 ways I see to store this data: > > 1) I can have a master index table where the 20 fields will be stored > for all hours. This will become a large table quickly, but will only be > 20 fields long. > > This will allow me to query one table for everything, but when data from > the the other 480 fields is needed I'd have to somehow link the data > from the master index table to the appropriate data table....could get > messy. Not necessarily - if there are patterns to queries, you can create a view so it looks like a simple query to the application. SQL is forgiving on whitespace too, so laying out a query over several lines with indents helps a lot too. BTW - 480 fields still seems very wide - why not split it into half a dozen tables based on what the fields mean? > 2) I can have a table to keep up with the date, time, hourly index > table, and hourly data table. I can use this as an aid in building my > unioned query. The tables are named with date and hour in them. > Assuming the table looks like: > > date date, > time time, > hr_idx_tbl char(16), > hr_data_tbl char(16) > > Building a [union] query could be like: > > SELECT hr_idx_tbl > FROM master_tbl_idx > WHERE CONCAT(DATE," ",TIME) BETWEEN ("2002/10/23 04:00:00" AND > "2002/10/23 09:00:00") > > The results would be index tables that I'd need to query. I'd then > generate a select for each table, and then union them together. If this is historical data, I'd go for two tables - "recent" and "old" - most queries will be on the recent table (because that's how you're defining recent) and if you need to go further back there's only two tables to union. Of course, your situation may vary. > 3) I can do the above, but store the data in tables by day instead of > hours. You'll have to test. It's going do depend on if people query for specific hours or for totals for January. > Ok, I think I've explained it as much as I can. > > On other note, how would I keep the sequences the same in both the index > table and the data table? In a transaction, can I insert the 480 fields > using the nextval() for the data table index field and then have another > insert for the 20 fields and use currval() for the index table index > field? Any other suggestions on how I'd do this? That's the way. Note that you could do several of these in one transaction which would help with performance on inserts. BTW - when I mentioned splitting the tables, I didn't mean into 20+480 instead of 500, I meant 20+40+23+56... etc. according to the *meaning* of the data. If much of the data really isn't worth recording much of the time, I'd probably have: CREATE TABLE ( id1 int4, id2 whatever, id3 int4, item_name varchar(64), item_value varchar(16) ); That means you don't have any data stored you don't need. > With all that said, I'd appreciate any help you can offer. > > Any help is appreciated! Hope it has been. PS - another idea you may want to look into is pre-calculating totals per-hour (or whatever) if that's the sort of thing your users are interested in. PPS - it might be you don't want a database - if all you're doing is looking up a few rows based on simple criteria, a flat-file system might be better. -- Richard Huxton
On 23 Oct 2002 at 12:39, Kevin Old wrote: > 1) Copy the 3 "unique fields" into every table (as you mention above). > My only question with this is how am I to do a join on a multicolumn > primary key? The only thing I can think of would be something like: > > SELECT * > FROM data_index, data_tbl > WHERE data_index.id1 = data_tbl.id1 > AND data_index.id2 = data_tbl.id2 > AND data_index.id3 = data_tbl.id3 Well you can store the 3 fields in a table, declare the combination of them as primary key and add a fourth field that is sequence generated and use this sequence number as primary key in rest of them tables. Will reuce the IO requirement by oodles.. HTH Bye Shridhar -- ASCII: The control code for all beginning programmers and those who would become computer literate. Etymologically, the term has come down as a contraction of the often-repeated phrase "ascii and you shall receive." -- Robb Russon