Re: Maximum number of tables - Mailing list pgsql-general

From Manos Karpathiotakis
Subject Re: Maximum number of tables
Date
Msg-id AANLkTikuwoiC34W54VDVX4SZgN2PQHHQrJ-aKSvYLN0x@mail.gmail.com
Whole thread Raw
In response to Re: Maximum number of tables  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Maximum number of tables  (Dmitriy Igrishin <dmitigr@gmail.com>)
List pgsql-general
Let me explain a few things about our dataset. We are using a system named Sesame [1] that stores and queries RDF data. In our case, it uses Postgres as a relational backend. In RDF, data are triples. Here is an example of an RDF triple: 

ex:Postgres rdf:type ex:RDBMS 

Triples are stored in Postgres. We can use two storing schemes. 

A monolithic scheme where every triple is stored in 1 huge table: Triple(int subject, int predicate, int object) and some additional information is stored in other tables of the form dictionary(int id, string original_value). This results in a schema with approximately 10 tables. 

A per-predicate scheme can also be used. This storing scheme creates a table for every distinct predicate. For example, to store the aforementioned triple, we would create a table type(int subject, int object) and we would insert a tuple with the encoded values for ex:Postgres and ex:RDBMS.

Queries for RDF data can be expressed in the SPARQL query language. Sesame translates SPARQL queries to SQL queries depending on the storing scheme being used. So, you can imagine that when we use the monolithic storing scheme, queries would be translated to an SQL query with many self-joins on a huge triple table. On the other hand, if we use the predicate schema, SPARQL queries are translated to many joins between smaller tables. 

In our case, we want to store 111M triples (and we would like to experiment with even datasets) that consists of approximately 10.000 distinct predicates. This means that when we would strongly prefer to use the per-predicate storing scheme to have faster query execution. However we are not experienced in tuning Postgres for this kind of data. 

Until now, we disabled WAL, disabled autocommit, increased shared buffers to 512mb, temp buffers to 64mb but we haven't tried disabling indices and foreign key constraints because it would require some code restructuring (I understand however the performance increase that we would get if we disabled them).

We are currently loading the dataset in chunks that perform 24.000.000 insertions to the database. 

Could you suggest some values for shared buffers, temp buffers, maintenance_work_mem, checkpoint_segments or other relevant parameters that we could use as a starting point? 

Best Regards,
Manos Karpathiotakis

[1] http://www.openrdf.org/

On Tue, Mar 15, 2011 at 7:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Manos Karpathiotakis <mk@di.uoa.gr> writes:
> Hi all,
> I am using postgres 9.0 under CentOS 5.5 (Dual Xeon Quad Core @ 2,44 GHz,
> 64GB RAM, 2TB RAID 5). In my case, postgres is used as a backend for the RDF
> store Sesame.
> I am trying to store a dataset that consists of approximately 900.000.000
> insertions (organized in either 10 tables, or in an arbitrary number of
> tables exceeding 2000). Loading the data in 10 tables takes about two days,
> while in the 2nd case, an org.postgresql.util.PSQLException: ERROR: out of
> shared memory error is thrown.

I'm guessing that you're seeing this through some client-side code that
helpfully suppresses the HINT about raising max_locks_per_transaction
:-(

If your application tries to touch all 2000 tables in one transaction,
you will need to raise that parameter to avoid running out of locktable
space for the AccessShareLocks it needs on all those tables.

However ... generally speaking, I'd guess that whatever you did to
refactor 10 tables into 2000 was a bad idea --- one table with an extra
key column is almost always a better design than N basically-identical
tables.  The latter will be a *lot* harder to use, and probably won't
fix your performance problem anyway.  You should attack the performance
problem in a different way.  Have you read
http://www.postgresql.org/docs/9.0/static/populate.html ?

                       regards, tom lane



--
===================================================
Manos Karpathiotakis
National & Kapodistrian University of Athens
Department of Informatics & Telecommunications, Office B25
Management of Data & Information Knowledge Group
Panepistimioupolis, Ilissia
GR-15784 Athens, Greece
Tel: +30.210.727.5159
Fax: +30.210.727.5214
e-mail: mk@di.uoa.gr
===================================================

pgsql-general by date:

Previous
From: "Igor Neyman"
Date:
Subject: Re: A join of 2 tables with sum(column) > 30
Next
From: Merlin Moncure
Date:
Subject: Re: equivalent of mysql's SET type?