Home > mailing lists

Re: Maximum number of tables - Mailing list pgsql-general

From	Manos Karpathiotakis
Subject	Re: Maximum number of tables
Date	March 16, 2011 14:36:35
Msg-id	AANLkTikuwoiC34W54VDVX4SZgN2PQHHQrJ-aKSvYLN0x@mail.gmail.com Whole thread Raw
In response to	Re: Maximum number of tables (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Maximum number of tables (Dmitriy Igrishin <dmitigr@gmail.com>)
List	pgsql-general

Tree view

Let me explain a few things about our dataset. We are using a system named Sesame [1] that stores and queries RDF data. In our case, it uses Postgres as a relational backend. In RDF, data are triples. Here is an example of an RDF triple:

ex:Postgres rdf:type ex:RDBMS

Triples are stored in Postgres. We can use two storing schemes.

A monolithic scheme where every triple is stored in 1 huge table: Triple(int subject, int predicate, int object) and some additional information is stored in other tables of the form dictionary(int id, string original_value). This results in a schema with approximately 10 tables.

A per-predicate scheme can also be used. This storing scheme creates a table for every distinct predicate. For example, to store the aforementioned triple, we would create a table type(int subject, int object) and we would insert a tuple with the encoded values for ex:Postgres and ex:RDBMS.

Queries for RDF data can be expressed in the SPARQL query language. Sesame translates SPARQL queries to SQL queries depending on the storing scheme being used. So, you can imagine that when we use the monolithic storing scheme, queries would be translated to an SQL query with many self-joins on a huge triple table. On the other hand, if we use the predicate schema, SPARQL queries are translated to many joins between smaller tables.

In our case, we want to store 111M triples (and we would like to experiment with even datasets) that consists of approximately 10.000 distinct predicates. This means that when we would strongly prefer to use the per-predicate storing scheme to have faster query execution. However we are not experienced in tuning Postgres for this kind of data.

Until now, we disabled WAL, disabled autocommit, increased shared buffers to 512mb, temp buffers to 64mb but we haven't tried disabling indices and foreign key constraints because it would require some code restructuring (I understand however the performance increase that we would get if we disabled them).

We are currently loading the dataset in chunks that perform 24.000.000 insertions to the database.

Could you suggest some values for shared buffers, temp buffers, maintenance_work_mem, checkpoint_segments or other relevant parameters that we could use as a starting point?

Best Regards,

Manos Karpathiotakis

[1] http://www.openrdf.org/

On Tue, Mar 15, 2011 at 7:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Manos Karpathiotakis <mk@di.uoa.gr> writes:
> Hi all,
> I am using postgres 9.0 under CentOS 5.5 (Dual Xeon Quad Core @ 2,44 GHz,
> 64GB RAM, 2TB RAID 5). In my case, postgres is used as a backend for the RDF
> store Sesame.
> I am trying to store a dataset that consists of approximately 900.000.000
> insertions (organized in either 10 tables, or in an arbitrary number of
> tables exceeding 2000). Loading the data in 10 tables takes about two days,
> while in the 2nd case, an org.postgresql.util.PSQLException: ERROR: out of
> shared memory error is thrown.

I'm guessing that you're seeing this through some client-side code that
helpfully suppresses the HINT about raising max_locks_per_transaction
:-(

If your application tries to touch all 2000 tables in one transaction,
you will need to raise that parameter to avoid running out of locktable
space for the AccessShareLocks it needs on all those tables.

However ... generally speaking, I'd guess that whatever you did to
refactor 10 tables into 2000 was a bad idea --- one table with an extra
key column is almost always a better design than N basically-identical
tables. The latter will be a *lot* harder to use, and probably won't
fix your performance problem anyway. You should attack the performance
problem in a different way. Have you read
http://www.postgresql.org/docs/9.0/static/populate.html ?

regards, tom lane

--
===================================================
Manos Karpathiotakis
National & Kapodistrian University of Athens
Department of Informatics & Telecommunications, Office B25
Management of Data & Information Knowledge Group
Panepistimioupolis, Ilissia
GR-15784 Athens, Greece
Tel: +30.210.727.5159
Fax: +30.210.727.5214
e-mail: mk@di.uoa.gr
===================================================

pgsql-general by date:

From: "Igor Neyman"
Date: 16 March 2011, 14:05:34
Subject: Re: A join of 2 tables with sum(column) > 30

From: Merlin Moncure
Date: 16 March 2011, 14:54:54
Subject: Re: equivalent of mysql's SET type?

Re: Maximum number of tables - Mailing list pgsql-general

Previous

Next