Thread: Query questions
I'm building a database containing key parameters for ~500,000 data files. The design I found logical is Two tables for each file: 1) Larger table with detailed key parameters (10-15 columns, ~1000 rows), call it large_table 2) Small table with file summary (~30 columns, 1 row), call it small_table A major purpose is to gain the ability to identify all data in the original archive (~3 terabyte) satisfying some criteria that can be specified in the database queries. But I find the query options surprisingly limited (or difficult - sure, things can be done if you write enough code...) Here is what I would like to be possible: SELECT <large_table columns> FROM <regular expression> WHERE <condition on large_table> IF <condition on corresponding small_table>; In words: For each of many large_tables, check corresponding small table and only process large_table if some condition is met. 2 problems: 1) Can't use regular expression to specify tables. Wouldn't that be nice? Is it possible to specify a lot of tables without having to write custom functions in e.g. plsql? 2) No IF statement. How do I issue commands (or redesign database) so I can check the file summary before diving into the large table? This ought to be simple... Any input will be highly appreciated, thanks in advance! Poul Jensen
On Sat, 2005-09-03 at 00:59 -0800, Poul Jensen wrote: > I'm building a database containing key parameters for ~500,000 data > files. The design I found logical is > > Two tables for each file: > 1) Larger table with detailed key parameters > (10-15 columns, ~1000 rows), call it large_table > 2) Small table with file summary > (~30 columns, 1 row), call it small_table you want to create 1 million tables, all with one of 2 schemas? why not just 2 tables, each with the additional "file" column ? > ... > SELECT <large_table columns> FROM <regular expression> > WHERE <condition on large_table> > IF <condition on corresponding small_table>; this would then be something like: SELECT <large_table columns> FROM large_table WHERE file ~ <regular expression> AND <condition on large_table> AND <subquery involving small_table> gnari
Thank you very much for your response! It leads to another couple of questions: >>I'm building a database containing key parameters for ~500,000 data >>files. The design I found logical is >> >>Two tables for each file: >>1) Larger table with detailed key parameters >> (10-15 columns, ~1000 rows), call it large_table >>2) Small table with file summary >> (~30 columns, 1 row), call it small_table >> >> > >you want to create 1 million tables, all with one of >2 schemas? > > I started out with a schema for each file, thinking I could utilize the schema structure in queries, but I don't see how. Schemas are useful for grouping tables according to users/owners. Other than that, do they add anything but a dot in the table name? >why not just 2 tables, each with the additional "file" >column ? > > >>... >>SELECT <large_table columns> FROM <regular expression> >> WHERE <condition on large_table> >> IF <condition on corresponding small_table>; >> >> > >this would then be something like: > > SELECT <large_table columns> FROM large_table > WHERE file ~ <regular expression> > AND <condition on large_table> > AND <subquery involving small_table> > > The large_table would have ~500 million rows, each of which would have to be checked for the first condition (pattern matching above). With separate tables there are "only" ~500,000 initial checks to do. Also, I don't see how to construct that small_table-subquery. If it is possible I would love to know how! Can you (or anybody else) give an example? Thank you, Poul
Poul Jensen <flyvholm@gfy.ku.dk> writes: >>you want to create 1 million tables, all with one of >>2 schemas? >> >> > > I started out with a schema for each file, thinking I could utilize > the schema > structure in queries, but I don't see how. Schemas are useful for grouping > tables according to users/owners. Other than that, do they add anything > but a dot in the table name? The word "schema" is a bit overloaded--here it means "table specification" rather than "namespace". A general rule of database design is: if you end up creating multiple tables with the same schema, you're doing it wrong--they should all be one table with an additional key (usually indexed) to distinguish the data. -Doug
>>>you want to create 1 million tables, all with one of >>>2 schemas? >>> >>> >>> >>> >>I started out with a schema for each file, thinking I could utilize >>the schema >>structure in queries, but I don't see how. Schemas are useful for grouping >>tables according to users/owners. Other than that, do they add anything >>but a dot in the table name? >> >> > >The word "schema" is a bit overloaded--here it means "table >specification" rather than "namespace". > >A general rule of database design is: if you end up creating multiple >tables with the same schema, you're doing it wrong--they should all be >one table with an additional key (usually indexed) to distinguish the >data. > >-Doug > > Thank you for clarifying -that had me confused! Would the reason for this general rule be that if you combine same-schema tables, you can access them all with one query? At least that would be the motivation in my case, but it isn't free. It would cost space (500 million entries for that extra key, add index) and query time (1st condition need now be checked on 500 million entries of which "only" the 500,000 unique ones should have to be checked - indexing helps, but there must be a loss). If this is really necessary it's because it's so unbelievably hard to access many tables in one query that I'm thinking the developers maintain restrictions on query options in order to force certain database designs - your general rule above. If so, what is the advantage? I see only disadvantages (a pain to be honest). Thank you, Poul
Poul Jensen wrote: > >>>> you want to create 1 million tables, all with one of >>>> 2 schemas? >>>> >>>> >>>> >>> >>> I started out with a schema for each file, thinking I could utilize >>> the schema >>> structure in queries, but I don't see how. Schemas are useful for >>> grouping >>> tables according to users/owners. Other than that, do they add anything >>> but a dot in the table name? >>> >> >> >> The word "schema" is a bit overloaded--here it means "table >> specification" rather than "namespace". >> >> A general rule of database design is: if you end up creating multiple >> tables with the same schema, you're doing it wrong--they should all be >> one table with an additional key (usually indexed) to distinguish the >> data. >> >> -Doug >> >> > Thank you for clarifying -that had me confused! > > Would the reason for this general rule be that if you combine same-schema > tables, you can access them all with one query? No. It is so that the single query becomes manageable. It also greatly simplifies database maintenance. > At least that would be the > motivation in my case, but it isn't free. It would cost space (500 > million entries > for that extra key, add index) and query time (1st condition need now be > checked on 500 million entries of which "only" the 500,000 unique ones > should have to be checked - indexing helps, but there must be a loss). I suppose how the data is distributed in the table. For the record, I don't think that 500M records should be that problematic. Now.... Suppose you need to further partition this (best to wait until 8.1 for performance reasons), but you could partition it into, say, 10 subtables. You could use inheritance and check constraints to make this transparent to the application and perform reasonably well. > > If this is really necessary it's because it's so unbelievably hard to > access > many tables in one query that I'm thinking the developers maintain > restrictions > on query options in order to force certain database designs - your > general rule > above. If so, what is the advantage? I see only disadvantages (a pain > to be > honest). Look into inheritance. It makes this easier. However, I don't care which RDBMS you use, management of 1000 identical tables is going to be a real pain and I think that everyone here will probably suggest that it is not exactly a sane thing to do. Best Wishes, Chris Travers Metatron Technology COnsulting
Attachment
> Look into inheritance. It makes this easier. However, I don't care > which RDBMS you use, management of 1000 identical tables is going to > be a real pain and I think that everyone here will probably suggest > that it is not exactly a sane thing to do. Thank you, Chris. I have omitted two important points. One is: The database needs no maintenance. Once created nothing in it will be modified - there'll only be SELECT queries to look up data satisfying certain criteria. To summarize the task: I have ~500,000 data files containing ~1,000 records each. The database should then contain: 1) A detailed table (~1,000 rows, ~15 columns) for each file 2) A small table with file summary (1 row, ~30 columns) for each file The typical query should then check file summaries to identify which detailed tables/subtables to check for further conditions. The other important point I forgot about (much worse): The detailed table for each file is created with a number of columns which for most files have the same value for all records. I had planned to delete columns containing only a single value and push it to the file summary instead. Hence, the 500,000 detailed tables wouldn't all be identical. The initial check on the file summary would determine whether the detailed table for a given file should be searched and, if so, what columns are found in it. I guess I could either: 1) Add a lot of redundant data to the database so tables can be combined, allowing single query searches. Judging from the size of my test database I would end up with ~200 GB without redundant data, so I do consider this a problem. 2) Write code (e.g. in C) with a loop to do separate queries for every file - I imagine this would be terribly inefficient compared to the single query option. Question: If I create a table with all the columns needed to combine the 500,000 tables and I simply omit the redundant columns when entering the subtable for a given file, will PostgreSQL fill in values, not allowing me to save any space? Would I have to make an inheritance hierarchy to save the space? If so, all columns not in the parent table can only be accessed via separate, child-table specific queries which is troublesome... :-| Help! Poul
I have ~500,000 data files each containing ~1,000 records that I want to put into a database for easy access. Fictive example for illustration: File w. meteorological data from a given station. stat_id | yr | d_o_y | hr | mn | tmp | wind ---------|------|-------|----|----|------|------ 78 | 2001 | 211 | 14 | 0 | 15.3 | 4.7 78 | 2001 | 211 | 14 | 1 | 15.4 | 5.4 78 | 2001 | 211 | 14 | 2 | 15.4 | 5.0 .. | .... | ... | .. | . | .... | ... 78 | 2001 | 211 | 14 | 24 | 15.8 | 2.7 Some parameters are constant for any file, e.g. station id. Others are constant for a varying percentage of the files (hr~58%, d_o_y~98%, yr~99.995% due to the oddly chosen 25 min. pr. file) It seems like a table with file summaries (1 row for each file) would be very useful. Some of the columns/rows could be: file | st_id | yr | d_o_y | avg_tmp | std_dev_tmp -------------|-------|------|-------|---------|-------------- 78_f6548.dat | 78 | 2001 | 211 | 15.5 | 0.24 57_f4296.dat | 57 | 2000 | -1 | 8.3 | 0.11 ... | ... | ... | ... | ... | ... (-1 for day of year indicates an odd file spanning across midnight) To store the detailed records the SQL novice would construct one table pr. file and exclude any constant columns since these are redundant (given in summary table). These detailed tables would then have different column schemas - (mn,tmp,wind) for ~58%, (hr,mn,tmp,wind) for ~40%, (d_o_y,hr,mn,tmp,wind) for ~2% and (yr,d_o_y,hr,mn,tmp,wind) for ~0.005%. SQL will discourage this novice. To enable simple queries data tables must be combined. Say I want to find all summer data with high temperature standard deviation and low wind. If I made the ultimate table for easy queries, containing all columns for both detailed data and file summary, I could find the data like this: SELECT file, std_dev_tmp, tmp, wind FROM huge_table WHERE std_dev_tmp>0.2 AND d_o_y>150 AND d_o_y<240 AND wind<4.0 The ability to make this easy query comes with a ridiculous amount of redundant data in huge_table. In order to remove all the stuff that is already given by the summary table I must split the huge table into a number of tables; one for each of above mentioned column schemas (in my case many more than in the example above). Difficult to build and no easy queries anyway - I'd need to write a C program to launch the many queries that would in effect return the same data as the single query above. Defeats the purpose of easy, efficient access. This novice must be missing a sneaky way to avoid massive redundancy and still maintain easy access. I've been suggested to look at inheritance and foreign keys. Foreign keys I don't see how to use, but I could make an inheritance hierarchy of the split up tables to make the parent table columns (mn,tmp,wind in above example) available across tables. But accessing the rest of the columns still require child-table specific queries, and my many child tables will not be nicely nested as in above example, so I still don't see how it can get me there. I could really use some input. One thought: PostgreSQL adds to every table a system column tableoid containing a constant value. Is that value really stored 1,000 times for a 1,000-row table? Or...? (Apologies for the length of this post)
On Thu, Sep 08, 2005 at 19:58:55 -0800, Poul Jensen <flyvholm@gfy.ku.dk> wrote: > > This novice must be missing a sneaky way to avoid massive redundancy and > still maintain easy access. I've been suggested to look at inheritance > and foreign keys. Foreign keys I don't see how to use, but I could make > an inheritance hierarchy of the split up tables to make the parent table > columns (mn,tmp,wind in above example) available across tables. But > accessing the rest of the columns still require child-table specific > queries, and my many child tables will not be nicely nested as in above > example, so I still don't see how it can get me there. I could really > use some input. You can also use views. Updatable views can be created using the rule system if you need them.
Poul Jensen <flyvholm@gfy.ku.dk> writes: > One thought: PostgreSQL adds to every table a system column tableoid > containing a constant value. Is that value really stored 1,000 times for > a 1,000-row table? Or...? No, tableoid is sort of a virtual column ... it doesn't exist on disk. When you query it you get a value fetched from the internal data structure representing the table. There is plenty of *real* overhead per row, mind you, but tableoid isn't part of it. You can read some details here: http://developer.postgresql.org/docs/postgres/storage-page-layout.html regards, tom lane
> To store the detailed records the SQL novice would construct one table > pr. file and exclude any constant columns since these are redundant > (given in summary table). These detailed tables would then have > different column schemas - (mn,tmp,wind) for ~58%, (hr,mn,tmp,wind) > for ~40%, (d_o_y,hr,mn,tmp,wind) for ~2% and (yr,d_o_y,hr,mn,tmp,wind) > for ~0.005%. I don't understand this at all - why would you have one table per file? Why not effectively append all of your per-file tables into one huge table, with an additional column indicating which file the data comes from? Then you have only two tables, one with a row for each observation in your data, one with a row for each file. Some queries need to join the two tables, but that's not a big deal. There may be a lot of redundancy in your data, but that's life. If it's not deterministic redundancy (e.g., file ID => station ID), then I would just live with it. It also seems to me that you may be tying your schema design too closely to the current way that the data is represented. Do you really need to have the data file figure so prominently in your design? If you do need to keep track of which file an observation came from, I might have one huge table of observations, each with an observation ID, and have another table that maps filenames to ranges of observation IDs (filename, startObs, endObs). A range query could then quickly find which table is responsible for an observation ID. Hmm, in fact if the redundant values you're worried about come in long stretches (e.g., temperature is the same for many observations in a row), I suppose you could do the same thing - map a constant value to the range of observation IDs for which it holds. This gets back to having many tables, though. - John Burger MITRE
On Thu, Sep 08, 2005 at 07:58:55PM -0800, Poul Jensen wrote: > This novice must be missing a sneaky way to avoid massive redundancy and > still maintain easy access. I've been suggested to look at inheritance > and foreign keys. Foreign keys I don't see how to use, but I could make What you seem to be looking for is a form of partitioning. PostgreSQL doesn't currently support partitioning of this form, but there's work in progress to change that. As someone else mentioned, you could do it with a union all view. http://cvs.distributed.net/viewcvs.cgi/stats-sql/logdb/ has an example of this. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Thank you for your input! Individual replies follow below. ############################## Chris Travers wrote: > Ok. Imagine two huge huge tables: > > file_data and additional_data > > create_table file_data ( > file_id serial primary key, > station_id text, > .... > ); > > create table additional_data ( > data_id bigserial primary key, > file_id int REFERENCES file_data (file_id), > temp, > .... > ); So the 500 million entries in additional_data.file_id are now restricted to belong to the same group of values found in file_data.file_id. But I don't see how this can get me rid of redundant data - I still have 500M entries? Consider the column "year". My 500M records come from 13 years. Saving 500M values when only 13 are distinct is very redundant. > > Also you can normalize your way out of the redundant data problem. > I can tell how to do this for filename and station name - I save something by saving an integer rather than a text string. But all the rest of my columns are already smallints. Do I save anything by storing "5" rather than "1997"? > > I would *highly* recommend waiting until 8.1 goes live to impliment > this in production. > It will be built/rebuilt after 8.1 is released. ############################## John D. Burger wrote: > Why not effectively append all of your per-file tables into one huge > table, with an additional column indicating which file the data comes > from? Then you have only two tables, one with a row for each > observation in your data, one with a row for each file. Some queries > need to join the two tables, but that's not a big deal. That big table with a row for each observation will have to include all columns, such as year and station_id. For such columns I store 500M values of which only ~15 are distinct in my case - the redundancy I'm trying to minimize. By splitting into smaller tables, many columns will contain only one distinct value. It *should* be possible to delete such columns and instead store their corresponding values in a "table header". If I had one table pr. file, the table you suggest with one row pr. file could effectively serve as the collection of table headers. > > It also seems to me that you may be tying your schema design too > closely to the current way that the data is represented. Do you > really need to have the data file figure so prominently in your design? The archive of data files is shared by a number of research groups around the world. There's a whole software package that people use for data analysis, accessing the data files as they are. So I expect a lot of file-specific queries. > Hmm, in fact if the redundant values you're worried about come in long > stretches (e.g., temperature is the same for many observations in a > row), I suppose you could do the same thing - map a constant value to > the range of observation IDs for which it holds. This gets back to > having many tables, though. > This is in effect normalization? But if the observation ID takes just as much storage as the original value, have we gained anything? BTW, I'm not aiming at redundancy in the measurements - this is minimal compared to e.g. year and station ID. I do have an idea of how it *should* be possible to get rid of much of it - read the response to Tom Lane below. ############################## Tom Lane wrote: >No, tableoid is sort of a virtual column ... it doesn't exist on disk. >When you query it you get a value fetched from the internal data >structure representing the table. > > So virtual columns are possible - THIS is a way to clear redundant data! Is it possible for a user to create a virtual column? If not, this would make a big improvement. What I really need are "partial virtual columns". I'm imagining an alternative version of VACUUM ANALYZE that could do the following: 1) Order the rows in the table so that for each column, identical values are placed next to each other for as far as possible (the row order that optimizes one column will probably not be optimal for other columns). 2) For each column, identify the stretches that contain only one distinct value. Save that value together with ID of start and end row and delete stretch. It is not obvious how to do a perfect optimization process in 1), at least not to me - I'm sure a skilled mathematician would know exactly how to do it. But here's a simple approach that would get us part of the way: 1.1) Grab the column w. most redundancy (fewest distinct values) and sort it into groups according to the distinct values. 1.2) For each of these groups, grab the column w. next most redundancy and sort into groups according to the distinct values. And so on. Stop whenever groups become so small that there's nothing to gain. Such an analysis would make it much less expensive to combine same-schema tables, and having everything in the same table is really convenient. It would obviously save a lot of storage space, but I imagine it would enable more efficient queries too - having to check just 3 values instead of the thousands (or even millions) they may replace must give a considerable gain. 'What is the big benefit of not having ordered rows? I imagine it could be a disadvantage for dynamic databases, but for a static database like mine which won't be modified, except for maybe adding new data once a year, I imagine an optimization including row ordering could be highly beneficial. ############################## Jim C. Nasby wrote: >What you seem to be looking for is a form of partitioning. PostgreSQL >doesn't currently support partitioning of this form, but there's work in >progress to change that. > > Any idea how far out in the future this is? Would it make the optimization process described above (reply to Tom Lane) obsolete? Well, maybe my ideas about an optimal solution just illustrate lack of knowledge about SQL, but I'm hoping somebody can see what I'm trying to suggest. >As someone else mentioned, you could do it with a union all view. >http://cvs.distributed.net/viewcvs.cgi/stats-sql/logdb/ has an example >of this. > > Thank you - it does look as if some union all views could come in handy. ############################## Thanks and best regards, Poul Jensen
>> Hmm, in fact if the redundant values you're worried about come in >> long stretches (e.g., temperature is the same for many observations >> in a row), I suppose you could do the same thing - map a constant >> value to the range of observation IDs for which it holds. This gets >> back to having many tables, though. > > This is in effect normalization? But if the observation ID takes just > as much storage as the original value, have we gained anything? BTW, > I'm not aiming at redundancy in the measurements - this is minimal > compared to e.g. year and station ID. I don't think this is exactly what people meant by normalization, but maybe. My basic thought was that you since you have some redundancy in your data, you might want to use some form of compression. One of the simplest forms of compression is called run-length encoding (http://en.wikipedia.org/wiki/Run_length_encoding). So you'd have most of your data in a main table: create table observations ( obsID integer primary key, -- Maybe a BIGINT temperature float, etc. ); and some other "compressed" tables for those features that have long runs of repetitive values: create table obsYears ( startObs integer primary key references observations (obsID), endObs integer references observations (obsID), year integer); create table obsStations ( startObs integer primary key references observations (obsID), endObs integer references observations (obsID), stationID integer); (Caution, I haven't checked these for syntax.) I've introduced an observation ID, and then I have "compressed" tables that map =ranges= of these IDs to values that are constant for long stretches. Each year occupies only one row, same with each station. (I think your reply to Tom may have been getting at something like this.) Now you can do queries like this, say, for temperature statistics in a particular year: select avg(temperature), stddev(temperature) from observations, obsYears where obsID between startObs and endObs and year = 2001; You could join in other compressed tables in the same way. In fact, you could glue them all together with a VIEW, and you'd be able to treat the whole thing like one giant table, with much of the redundancy removed. Note that if you define indexes on the startObs and endObs columns, Postgresql might avoid scanning through the compressed tables every time you do a query. You might also benefit from a composite index on (startObs, endObs). For features like year, which are massively repetitive, this might even be faster than storing the feature in the main table, since the compressed table will easily fit in memory. So the basic idea is run-length encoding for repetitive values. I think I can use this in some of my own data - I don't know why I never thought of it before. - John D. Burger MITRE
Thank you, John! I misunderstood you the first time, but I now see we have the same thing in mind. > So you'd have most of your data in a main table: > > create table observations ( > obsID integer primary key, -- Maybe a BIGINT > temperature float, > etc. > ); > > and some other "compressed" tables for those features that have long > runs of repetitive values: > > create table obsYears ( > startObs integer primary key references observations > (obsID), > endObs integer references observations (obsID), > year integer); > > create table obsStations ( > startObs integer primary key references observations > (obsID), > endObs integer references observations (obsID), > stationID integer); > > (Caution, I haven't checked these for syntax.) I've introduced an > observation ID, and then I have "compressed" tables that map =ranges= > of these IDs to values that are constant for long stretches. Each > year occupies only one row, same with each station. One complication: Applying the observation ID you're in effect ordering the rows. If you order them chronologically, "year" is perfectly lined up, giving you one row pr. value in your compressed table, but e.g. "month" will be split up in n_years*12 stretches of obsIDs, and "station_id" may not have any continuous stretches of obsIDs at all. I don't see any solution to this, but better compression can be achieved by ordering rows optimally when applying the obsID. The reply to Tom Lane in my previous post suggested one way to do this - it may not always be optimal, but at least it's simple. > Now you can do queries like this, say, for temperature statistics in a > particular year: > > select avg(temperature), stddev(temperature) from observations, > obsYears > where obsID between startObs and endObs > and year = 2001; This works! I had not yet realized how to make this connection between two tables, so that was a major help - thank you. > > You could join in other compressed tables in the same way. In fact, > you could glue them all together with a VIEW, and you'd be able to > treat the whole thing like one giant table, with much of the > redundancy removed. That is exactly what I want, and now I finally see how to do it (I think!). However, it is a considerable amount of work to set this up manually, plus, it has been a headache realizing how to get there at all. I'm hoping that one or more of the developers think it would be a good idea for PostgreSQL to perform an internal table optimization process using run-length encoding. Imagine you could just throw all your data into one table, run OPTIMIZE TABLE and you'd be done. With SQL being all about tables I'm surprised this idea (or something even better) hasn't been implemented already. Poul Jensen
> > That is exactly what I want, and now I finally see how to do it (I > think!). However, it is a considerable amount of work to set this up > manually, plus, it has been a headache realizing how to get there at > all. I'm hoping that one or more of the developers think it would be a > good idea for PostgreSQL to perform an internal table optimization > process using run-length encoding. Imagine you could just throw all your > data into one table, run OPTIMIZE TABLE and you'd be done. With SQL > being all about tables I'm surprised this idea (or something even > better) hasn't been implemented already. There was a recent brief thread here on storing timeseries data, where the use of clustered indices for static tables was suggested. This might also be useful in your situation... Cheers, Brent Wood
On Sun, Sep 11, 2005 at 11:00:02PM -0800, Poul Jensen wrote: > Tom Lane wrote: > > >No, tableoid is sort of a virtual column ... it doesn't exist on disk. > >When you query it you get a value fetched from the internal data > >structure representing the table. > > > > > So virtual columns are possible - THIS is a way to clear redundant data! > Is it possible for a user to create a virtual column? If not, this would > make a big improvement. > > What I really need are "partial virtual columns". I'm imagining an > alternative version of VACUUM ANALYZE that could do the following: > 1) Order the rows in the table so that for each column, identical values > are placed next to each other for as far as possible (the row order that > optimizes one column will probably not be optimal for other columns). > 2) For each column, identify the stretches that contain only one > distinct value. Save that value together with ID of start and end row > and delete stretch. > It is not obvious how to do a perfect optimization process in 1), at > least not to me - I'm sure a skilled mathematician would know exactly > how to do it. But here's a simple approach that would get us part of the > way: > 1.1) Grab the column w. most redundancy (fewest distinct values) and > sort it into groups according to the distinct values. > 1.2) For each of these groups, grab the column w. next most redundancy > and sort into groups according to the distinct values. > And so on. Stop whenever groups become so small that there's nothing to > gain. > Such an analysis would make it much less expensive to combine > same-schema tables, and having everything in the same table is really > convenient. It would obviously save a lot of storage space, but I > imagine it would enable more efficient queries too - having to check > just 3 values instead of the thousands (or even millions) they may > replace must give a considerable gain. > > 'What is the big benefit of not having ordered rows? I imagine it could > be a disadvantage for dynamic databases, but for a static database like > mine which won't be modified, except for maybe adding new data once a > year, I imagine an optimization including row ordering could be highly > beneficial. Oracle supports something akin to this. On an Index Organized Table you can tell it to pull part of the index key out of individual rows. IE: CREATE TABLE (a, b, c, d, e, f) PRIMARY KEY(a, b, c, d) INDEX ORGANIZED COMBINE(a, b) ; In this case, every time the combination of (a, b) changes, Oracle stores a special record of some kind that indicates the change, and it doesn't store a or b with each row. (Note that I'm guessing on syntax, it's been a while since I've used Oracle). > ############################## > > Jim C. Nasby wrote: > > >What you seem to be looking for is a form of partitioning. PostgreSQL > >doesn't currently support partitioning of this form, but there's work in > >progress to change that. > > > > > Any idea how far out in the future this is? Would it make the > optimization process described above (reply to Tom Lane) obsolete? Well, > maybe my ideas about an optimal solution just illustrate lack of > knowledge about SQL, but I'm hoping somebody can see what I'm trying to > suggest. Well, the firslt important step is slated for 8.1; effective partition elimination. You can search for more info, especially in the BizGres archives. Next step is creating actual partitioning syntax to make managing partitions easier. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
I need to fetch strings from a database with ECPG and then sort them in C. Here is one of my failed attempts: ########################### int main(int argc, char *argv[]) { int maxlen=20; long nrec; EXEC SQL BEGIN DECLARE SECTION; varchar filenms[][maxlen]=NULL; char dbnm[50]; EXEC SQL END DECLARE SECTION; sprintf(dbnm,"%s",argv[1]); EXEC SQL CONNECT TO :dbnm; EXEC SQL SELECT filenm INTO :filenms FROM beamdata; nrec = sqlca.sqlerrd[2]; /* Returns number of rows retrieved */ EXEC SQL COMMIT; EXEC SQL DISCONNECT; qsort(filenms, nrec, maxlen*sizeof(char), scmp); free(filenms); return 0; } static int scmp( const void *sp1, const void *sp2 ) { return( strcmp(*(char **)sp1, *(char **)sp2) ); } ########################### It compiles ok, but I get garbage in variable filenms. If I change the declaration of filenms to: char **filenms=NULL; the SQL query returns strings ok. But the strings have variable length, and I need to specify one length in qsort, so it won't work. Another attempt to try to specify string length: char (*filenms)[maxlen]=NULL; Won't compile, ECPG doesn't accept this syntax. Well, and strcmp crashes (segmentation fault) in function scmp regardless what I try. Not SQL error, I know, but if anybody can tell why I'd be grateful. Any suggestions? Thanks, Poul Jensen
Please don't use "reply" to start new thread, thanks. On Fri, Sep 08, 2006 at 05:55:44AM -0800, Poul Jensen wrote: > I need to fetch strings from a database with ECPG and then sort them in > C. Here is one of my failed attempts: <snip> > varchar filenms[][maxlen]=NULL; I think you need to reread the C manual as to the difference between arrays and pointers. I'm don't know about ECPG, but I think your other attempt, using "char**" was much closer to the mark. > It compiles ok, but I get garbage in variable filenms. If I change the > declaration of filenms to: > char **filenms=NULL; > the SQL query returns strings ok. But the strings have variable length, > and I need to specify one length in qsort, so it won't work. Another Since you're declaring a array of pointers to char, the width you have to specify to qsort would be sizeof(char*). I think you can just use the normal strcmp() function with qsort then. Ofcourse, you could get the database to sort them for you... Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Martijn van Oosterhout wrote: > Please don't use "reply" to start new thread, thanks. > > On Fri, Sep 08, 2006 at 05:55:44AM -0800, Poul Jensen wrote: > >> I need to fetch strings from a database with ECPG and then sort them in >> C. Here is one of my failed attempts: >> > > <snip> > >> varchar filenms[][maxlen]=NULL; >> > > I think you need to reread the C manual as to the difference between > arrays and pointers. I'm don't know about ECPG, but I think your other > attempt, using "char**" was much closer to the mark. > > >> It compiles ok, but I get garbage in variable filenms. If I change the >> declaration of filenms to: >> char **filenms=NULL; >> the SQL query returns strings ok. But the strings have variable length, >> and I need to specify one length in qsort, so it won't work. Another >> > > Since you're declaring a array of pointers to char, the width you have > to specify to qsort would be sizeof(char*). I think you can just use > the normal strcmp() function with qsort then. > > Ofcourse, you could get the database to sort them for you... > > Hope this helps, > I'm afraid it didn't for various reasons, but I appreciate you trying. What I ended up doing was simply declaring an extra array outside the SQL declare section: char (*tmp)[maxlen]; Then allocate the required memory once known: if (( tmp = malloc(maxlen*nrec*sizeof(char)) ) == NULL) { fprintf(stderr,"Memory allocation failure\n"); exit(-1); } and then just copy the strings into it one by one. As for strcmp it *can* be used directly in qsort, but not without some manipulation: qsort(tmp, nrec, maxlen*sizeof(char), (int(*)(const void*, const void*))&strcmp); This resolved my issues. For now. ;-)