Thread: Splitting Postgres into Separate Clusters?
Hello,
I'm running a specialized search engine that indexes a few tens of millions of web pages, keeping everything in Postgres, and one problem I'm starting to see is poor cache hit rates. My database has two or three tables just for the text of the scraped pages, with one row every time a page is scraped and a `text` column for the HTML. These tables are almost-but-not-quite write-only. They are only read by one job, which uses them to create other objects in the system. I'd like the rest of the database to be in-memory all the time, but I don't really care if I have to read these tables from disk. To keep my problem tables from dragging down performance on the rest of the system, I'm thinking of splitting them out. I suppose I could save the HTML outside of Postgres entirely, but I'm wondering if a nicer solution would be to keep these tables in a separate "cluster" (i.e. /var/lib/postgresql/9.1/something_else -- I hope this is the right terminology). Then I could tune that cluster differently from the main cluster, or even put it on a different machine. And I could still use dblink to query both clusters conveniently (I think; this isn't really that important). Does this seem like a worthwhile approach? Is there a better way to deal with a few out-sized tables?
Thanks,
Paul
--
_________________________________
Pulchritudo splendor veritatis.
I'm running a specialized search engine that indexes a few tens of millions of web pages, keeping everything in Postgres, and one problem I'm starting to see is poor cache hit rates. My database has two or three tables just for the text of the scraped pages, with one row every time a page is scraped and a `text` column for the HTML. These tables are almost-but-not-quite write-only. They are only read by one job, which uses them to create other objects in the system. I'd like the rest of the database to be in-memory all the time, but I don't really care if I have to read these tables from disk. To keep my problem tables from dragging down performance on the rest of the system, I'm thinking of splitting them out. I suppose I could save the HTML outside of Postgres entirely, but I'm wondering if a nicer solution would be to keep these tables in a separate "cluster" (i.e. /var/lib/postgresql/9.1/something_else -- I hope this is the right terminology). Then I could tune that cluster differently from the main cluster, or even put it on a different machine. And I could still use dblink to query both clusters conveniently (I think; this isn't really that important). Does this seem like a worthwhile approach? Is there a better way to deal with a few out-sized tables?
Thanks,
Paul
--
_________________________________
Pulchritudo splendor veritatis.
On Mar 9, 2013, at 11:54 AM, Paul Jungwirth wrote:
Hello,
I'm running a specialized search engine that indexes a few tens of millions of web pages, keeping everything in Postgres, and one problem I'm starting to see is poor cache hit rates. My database has two or three tables just for the text of the scraped pages, with one row every time a page is scraped and a `text` column for the HTML. These tables are almost-but-not-quite write-only. They are only read by one job, which uses them to create other objects in the system. I'd like the rest of the database to be in-memory all the time, but I don't really care if I have to read these tables from disk. To keep my problem tables from dragging down performance on the rest of the system, I'm thinking of splitting them out. I suppose I could save the HTML outside of Postgres entirely, but I'm wondering if a nicer solution would be to keep these tables in a separate "cluster" (i.e. /var/lib/postgresql/9.1/something_else -- I hope this is the right terminology). Then I could tune that cluster differently from the main cluster, or even put it on a different machine. And I could still use dblink to query both clusters conveniently (I think; this isn't really that important). Does this seem like a worthwhile approach? Is there a better way to deal with a few out-sized tables?
Well, what problem exactly are you trying to solve? Having large tables itself isn't a problem, but it often tends to imply other things that might be problematic:
- large vacuum or analyze times. (Probably fixable in your case with per-table autovacuum thresholds.)
- large disk usage. (Probably fixable in your case with tablespaces.)
- slow selects or updates. (Probably fixable in your case with partitioning.)
- blowing out your buffer cache with useless dirty pages - this is where you might want to look into separate servers.
> Well, what problem exactly are you trying to solve? > Having large tables itself isn't a problem, but it often > tends to imply other things that might be problematic: I'm trying to troubleshoot a very low cache hit rate as returned by this query: SELECT sum(heap_blks_read) as heap_read, sum(heap_blks_hit) as heap_hit, sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio FROM pg_statio_user_tables; So I think that's your fourth option: > - blowing out your buffer cache with useless dirty pages - this is where you might want to look into separate servers. So it sounds like I'm on the right track. But a separate cluster/server seems like a drastic solution. Thanks, Paul On Mon, Mar 11, 2013 at 12:17 PM, Ben Chobot <bench@silentmedia.com> wrote: > > On Mar 9, 2013, at 11:54 AM, Paul Jungwirth wrote: > > Hello, > > I'm running a specialized search engine that indexes a few tens of millions of web pages, keeping everything in Postgres,and one problem I'm starting to see is poor cache hit rates. My database has two or three tables just for the textof the scraped pages, with one row every time a page is scraped and a `text` column for the HTML. These tables are almost-but-not-quitewrite-only. They are only read by one job, which uses them to create other objects in the system. I'dlike the rest of the database to be in-memory all the time, but I don't really care if I have to read these tables fromdisk. To keep my problem tables from dragging down performance on the rest of the system, I'm thinking of splitting themout. I suppose I could save the HTML outside of Postgres entirely, but I'm wondering if a nicer solution would be tokeep these tables in a separate "cluster" (i.e. /var/lib/postgresql/9.1/something_else -- I hope this is the right terminology).Then I could tune that cluster differently from the main cluster, or even put it on a different machine. AndI could still use dblink to query both clusters conveniently (I think; this isn't really that important). Does this seemlike a worthwhile approach? Is there a better way to deal with a few out-sized tables? > > > Well, what problem exactly are you trying to solve? Having large tables itself isn't a problem, but it often tends to implyother things that might be problematic: > > - large vacuum or analyze times. (Probably fixable in your case with per-table autovacuum thresholds.) > - large disk usage. (Probably fixable in your case with tablespaces.) > - slow selects or updates. (Probably fixable in your case with partitioning.) > - blowing out your buffer cache with useless dirty pages - this is where you might want to look into separate servers. -- _________________________________ Pulchritudo splendor veritatis.