Thread: Design advice requested
Apologies for the long email.
I would appreciate some advice designing a database which
can become very complicated. The purpose of the database is to be a
source for bibliometric research.
Some background: I work with various sources:
* The largest source consists of tagged delimited text files. The
database formed from these files takes up about 230Gb at present. I
receive weekly updates of new and/or corrected data. I use a python
script to parse those files and insert the data into the database.
* Other sources are much smaller (up to about 300 000 records in
csv-files or Microsoft Access databases) and of various formats.
Those sources are also updated from time to time.
My aim is to have a database that can easily accommodate the raw data
as well as one providing a user friendly structure for researchers
using the data.
I am working on a multilevel database:
* level 0: where the data from the various sources enters. The
structure is similar to that of the source. No editing of
data other than the updated from the source takes place on
this level
* level 1: where the data are made available from the sources from
level 0 in a common structure which are more suitable for
the type of research we are doing, but still separate for
each source. Not all the data is important on this level
so not all the fields from level 0 are represented here.
The data can be added or modified and value added
(providing some extra tables for value added data which are
source-specific). This level gets populated from level 0
in such a way that value added data don't get overwritten.
I found a valuable article for the basic design of this
level in an article:
Title: A relational database for bibliometric analysis
Author(s): Mallig Nicolai
Source: JOURNAL OF INFORMETRICS Volume: 4 Issue: 4
Pages: 564-580 DOI: 10.1016/j.joi.2010.06.007
Published: OCT 2010
* level 2: An abstraction layer where queries spanning the various
sources are running. This is a read-only layer.
I have considered using views for a large part of level 1 but that
seems not to be practical: Queries involving many millions of records
(there are at present about 43 000 000 articles in the article-table)
takes a long time to execute especially when aggregates are involved.
I want to conserve space on the one hand and speed up queries on the
other hand. Sometimes I think it is not possible to have both.
In stead of views a series of tables created from standard queries may
speed up the process, I have thought of creating temporary tables that
can be populated periodically at times (like each night or every few
hours) using the standard queries.
Some of the queries will make use of full text searching as they
involve large fields like article abstracts.
Basically my request is for advice on how to make this database as
fast as possible with as few instances of duplicated data while
providing both for the updates on level 0 and value added editing on
level 1.
RegardsI would appreciate some advice designing a database which
can become very complicated. The purpose of the database is to be a
source for bibliometric research.
Some background: I work with various sources:
* The largest source consists of tagged delimited text files. The
database formed from these files takes up about 230Gb at present. I
receive weekly updates of new and/or corrected data. I use a python
script to parse those files and insert the data into the database.
* Other sources are much smaller (up to about 300 000 records in
csv-files or Microsoft Access databases) and of various formats.
Those sources are also updated from time to time.
My aim is to have a database that can easily accommodate the raw data
as well as one providing a user friendly structure for researchers
using the data.
I am working on a multilevel database:
* level 0: where the data from the various sources enters. The
structure is similar to that of the source. No editing of
data other than the updated from the source takes place on
this level
* level 1: where the data are made available from the sources from
level 0 in a common structure which are more suitable for
the type of research we are doing, but still separate for
each source. Not all the data is important on this level
so not all the fields from level 0 are represented here.
The data can be added or modified and value added
(providing some extra tables for value added data which are
source-specific). This level gets populated from level 0
in such a way that value added data don't get overwritten.
I found a valuable article for the basic design of this
level in an article:
Title: A relational database for bibliometric analysis
Author(s): Mallig Nicolai
Source: JOURNAL OF INFORMETRICS Volume: 4 Issue: 4
Pages: 564-580 DOI: 10.1016/j.joi.2010.06.007
Published: OCT 2010
* level 2: An abstraction layer where queries spanning the various
sources are running. This is a read-only layer.
I have considered using views for a large part of level 1 but that
seems not to be practical: Queries involving many millions of records
(there are at present about 43 000 000 articles in the article-table)
takes a long time to execute especially when aggregates are involved.
I want to conserve space on the one hand and speed up queries on the
other hand. Sometimes I think it is not possible to have both.
In stead of views a series of tables created from standard queries may
speed up the process, I have thought of creating temporary tables that
can be populated periodically at times (like each night or every few
hours) using the standard queries.
Some of the queries will make use of full text searching as they
involve large fields like article abstracts.
Basically my request is for advice on how to make this database as
fast as possible with as few instances of duplicated data while
providing both for the updates on level 0 and value added editing on
level 1.
Johann
--
Because experiencing your loyal love is better than life itself,
my lips will praise you. (Psalm 63:3)
On 08/05/13 21:21, Johann Spies wrote: > Basically my request is for advice on how to make this database as > fast as possible with as few instances of duplicated data while > providing both for the updates on level 0 and value added editing on > level 1. > > Regards > Johann Hi Johann. Firstly, don't worry too much about speed in the design phase, there may be differences of opinion here, but mine is that even with database design the first fundamental layer is the relationship model. That is, regardless of how the raw data is presented to you (CSV, raw text, other relationship models or ideas), the same back to basics problem must be solved - What is the most effective and efficient way of storing this data, that will allow for database flexibility and scalability (future adaptation of new relationship models). Secondly, assuming the CSV and other raw data is in the flat (fat) table format (contains columns of duplicate data). Its your job to determine how to break it down into separate sections (tables) of data and how they relate to each other (normalization). One to many, many to many, etc. There's also other things to consider (i.e data history, revision), but those are the basics. Thirdly, its the queries and the relationships they reveal (joins) between sections of data (tables) that assist in making the data presentable and you can always later on utilize caches for blocks of data that can be in the database itself (temp tables, MV's etc) TIP: whether its temps, views, or materialized views its a good idea to be consistent with the name i.e. "some_view". This provides a level of abstraction and is handy in the design phase. It doesn't matter if you are dealing with petabytes of data. Thats all I can suggest without actually looking at a sample of the data (problem) you are dealing with. Its a matter of breaking it down into logical steps and having some fun. Regards, Julian.
On 09/05/13 17:42, Johann Spies wrote: > Hallo Julian, > > Thanks for your reply. > > Firstly, don't worry too much about speed in the design phase, > there may > be differences of opinion here, but mine is that even with database > design the first fundamental layer is the relationship model. > > > It is good to hear but when a simple query requesting 20 records takes > nearly 7 minutes to complete, it becomes nearly unusable. Hi, can you reply to the list? This is a performance question now. You might want to start a new thread on your performance issues. Have you utilized indexes? (refer to Hash Cond) Hash Cond: ((rauthor.rart_id)::text = (ritem.ut)::text) look at "actual time=start..finish" on the planner process blocks and also the finish time of the preceding block. Buffers: shared hit=104662 read=4745067, temp read=1006508 written=1006446 Give the specs of your hardware. More experience people will be able to advise on getting the best out of this query. Once you are reasonably happy with your schema you might want to consider partitioning the larger datasets. You can also cache (temp) blocks of data how you see appropriate (snapshots). http://www.postgresql.org/docs/9.2/static/ddl-partitioning.html Regards. Julian.