Re: improvement suggestions for performance design - Mailing list pgsql-performance
From | Heikki Linnakangas |
---|---|
Subject | Re: improvement suggestions for performance design |
Date | |
Msg-id | 468D3AF7.4070505@enterprisedb.com Whole thread Raw |
In response to | Re: improvement suggestions for performance design (tfinneid@ifi.uio.no) |
Responses |
Re: improvement suggestions for performance design
|
List | pgsql-performance |
tfinneid@ifi.uio.no wrote: >> I would strongly suggest that you use a proper relational schema, >> instead of storing everything in two tables. I don't know your >> application, but a schema like that is called an Entity-Attribute-Value >> (though your entity seems to be just posx and posy) and it should raise >> a big red flag in the mind of any database designer. In particular, >> constructing queries against an EAV schema is a major pain in the ass. >> This has been discussed before on postgresql lists as well, you might >> want to search and read the previous discussions. > > I get your point, but the thing is the attributes have no particular > relation to each other, other than belonging to same attribute groups. > There are no specific rules that states that certain attributes are always > used together, such as with an address record. It depends on what > attributes the operator wants to study. This is why I don't find any > reason to group the attributes into separate tables and columns. ISTM that a properly normalized schema would look something like this: create table position ( posX int not null, posY int not null, primary key (posX, posY) ); create table colour ( posX int not null, posY int not null, colour varchar(50) not null, primary key (posX, posY), foreign key (posX, posY) references position (posX, posY) ); create table population ( posX int not null, posY int not null, population int notn u, primary key (posX, posY), foreign key (posX, posY) references position (posX, posY) ); where colour and population are examples of attributes you want to store. If you have 100 different attributes, you'll have 100 tables like that. That may sound like a lot, but it's not. This allows you to use proper data types for the attributes, as well as constraints and all the other goodies a good relational data model gives you It also allows you to build proper indexes on the attributes. For example, if you store populations as text, you're going to have a hard time building an index that allows you to query for positions with a population between 100-2000 efficiently. These are all imaginary examples, but what I'm trying to point out here is that a proper relational schema allows you to manage and query your data much more easily and with more flexibility, allows for future extensions. A normalized schema will also take less space, which means less I/O and more performance, because there's no need to store metadata like the data_type, attr_type on every row. For performance reasons, you might actually want to not store the position-table at all in the above schema. An alternative design would be to have a single table, with one column per attribute: create table position ( posX int not null, posY int not null, colour varchar(50), population int, ... primary key (posX, posY) ) This is more space-efficient, especially if you have a lot of attributes on same coordinates. You can easily add and drop columns as needed, using ALTER TABLE. > I am still looking into the design of the tables, but I need to get at > proper test harness running before I can start ruling things out. And a > part of that, is for example, efficient ways of transferring the insert > data from the client to the db, instead of just single command inserts. > This is where bulk transfer by arrays probably would be preferable. Before you start fiddling with functions, I'd suggest that you try batching the inserts with the JDBC PreparedStatement batch facility. Splitting the inserts into multiple threads in your application sounds messy. The inserts would have to be in separate transactions, for example. Presumably your CORBA ORB will spawn multiple threads for you when there's a lot requests coming in, so the overall throughput should be the same with a single thread per request. BTW, I concur with Y Sidhu that with data volumes as high as you have, partitioning is a good idea. It's a lot easier to manage 20 10 GB table partitions, than one 200 GB table. For example, VACUUM, CLUSTER, CREATE INDEX can be done partition per partition, instead of as a single huge operatio that runs for hours. Though if you choose to have just one table per attribute type, each table might be conveniently small by nature, so that no partitioning is required. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-performance by date: