Home > mailing lists

Re: WIP: store additional info in GIN index - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: WIP: store additional info in GIN index
Date	December 23, 2012 23:48:52
Msg-id	50D79861.3030303@fuzzy.cz Whole thread Raw
In response to	Re: WIP: store additional info in GIN index (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	Re: WIP: store additional info in GIN index
List	pgsql-hackers

Tree view

Hi!

On 22.12.2012 17:15, Alexander Korotkov wrote:
>     I'm not saying this is a perfect benchmark, but the differences (of
>     querying) are pretty huge. Not sure where this difference comes from,
>     but it seems to be quite consistent (I usually get +-10% results, which
>     is negligible considering the huge difference).
> 
>     Is this an expected behaviour that will be fixed by another patch?
> 
>  
> Another patches which significantly accelerate index search will be
> provided. This patch changes only GIN posting lists/trees storage.
> However, it wasn't expected that this patch significantly changes index
> scan speed in any direction.

That was exactly my expectation - probably not an improvement, but
definitely not a worse performance.

> 
>     The database contains ~680k messages from the mailing list archives,
>     i.e. about 900 MB of data (in the table), and the GIN index on tsvector
>     is about 900MB too. So the whole dataset nicely fits into memory (8GB
>     RAM), and it seems to be completely CPU bound (no I/O activity at all).
> 
>     The configuration was exactly the same in both cases
> 
>         shared buffers = 1GB
>         work mem = 64 MB
>         maintenance work mem = 256 MB
> 
>     I can either upload the database somewhere, or provide the benchmarking
>     script if needed.
> 
> 
> Unfortunately, I can't reproduce such huge slowdown on my testcases.
> Could you share both database and benchmarking script?

It's strange, but no matter what I do I can't reproduce those results
(with the significant performance decrease). So either I've done some
strange mistake when running those tests, or there was something wrong
with my system, or whatever :-(

But when running the benchmarks now (double-checked everything, properly
repeated the tests, ...), I've noticed a different behaviour. But first
some info about the scripts I use for testing.

All the scripts are available here:
 https://bitbucket.org/tvondra/archie

It's my "hobby project" implementing fulltext mbox archive. It should be
usable but it's still a bit WIP so let me know in case of any issues.

The README should give you all the instructions on how to setup and load
the database. I'm using ~1700 mbox files downloaded from
http://archives.postgresql.org/ for these lists (until 2012/11):
  pgadmin-hackers  pgsql-advocacy  pgsql-announce  pgsql-bugs  pgsql-general  pgsql-hackers  pgsql-jdbc  pgsql-novice
pgsql-odbc pgsql-patches  pgsql-sql

which in the end gives ~677k rows in the 'messages' table, occupying
~5.5GB disk space (including all the indexes etc).

Once you have the data loaded, you need to warmup the database and then
start benchmarking it - I'm using the warmup.py script to both things.
The script is quite simple, it basically just

To warmup the DB, just run this
   ./warmup.py --db archie --duration 300

until the %util drops near 0 (assuming you have enough RAM to fit the
whole database into memory). Then I usually do this as a benchmarking
   ./warmup.py --db archie --duration 300 --no-hash \                           --no-thread --words 1
   ./warmup.py --db archie --duration 300 --no-hash \                           --no-thread --words 2

which runs 60-second tests and outputs one line for worker (by default
equal to the number of CPUs).

The script itself is very simple, it fetches a random message and uses
the tsvector column as a source of words for the actual benchmark. It
takes N words from the tsvector, splits them into groups and performs a
simple fulltext query using plainto_tsquery('word1 word2 ...'). At the
end it prints info including the number of queries per second.

I've run the tests on the current master with and without the v3 patch.
I've tested it with 1GB or 2GB shared buffers, and 32MB or 64MB work mem.

The tests were run for 1, 2, 3, 4 and 5 words, and I've repeated it five
times for each configuration. Duration of each run was 5-minutes.

These are the averages (from the 5 runs) of queries per second for each
combination of parameters:
                         1     2     3     4     5
----------------------------------------------------master     1GB/32MB     19   179   165   127    99patched
1GB/32MB    19   175   163   124    96

master     1GB/64MB     20   181   165   127    99patched    1GB/64MB     19   174   159   120    95
master     2GB/32MB     27   181   165   127    98patched    2GB/32MB     25   176   156   120    93
master     2GB/64MB     27   180   166   128   102patched    2GB/64MB     40   402   364   245   176

There's no significant difference in performance, except for the
2GB/64MB combination. And in that case it's actually the opposite
direction than I've reported before - i.e. this time it's up to 100%
faster than the unpatched master. The results are pretty consistent
(very small variance across the repeated runs), so I'm not sure about
the previous results.

Any idea what might cause such behavior? Why should it happen only with
this particular combination of shared_buffers and work_mem?

kind regards
Tomas

pgsql-hackers by date:

From: Jeff Janes
Date: 23 December 2012, 23:11:27
Subject: initdb and share/postgresql.conf.sample

From: Greg Stark
Date: 24 December 2012, 00:02:24
Subject: Re: initdb and share/postgresql.conf.sample

Re: WIP: store additional info in GIN index - Mailing list pgsql-hackers

Previous

Next