Thread: Merging Data from Multiple DB
Hi, I am working on a application and I need some advice on the data warehousing aspect of it. The situation is like this: I have an application that will be used at each one of my branch offices. Each branch office is autonomous and each has its own instance of the db. There is currently no interaction between branches. However, I need to do centralized reporting. This means I need a way to merge all the data from the branches into the HQ central data warehouse. Obviously, there is a hugh problem with duplicated primary keys. For performance reasons, each branch must has its own database and a centralized transactional system is not an option. I was considering just centralizing primary keys generation, but that seems very slow too. Segmenting primary keys among the branches is doable, but it is too much of a maintainence nightmare. What do you suggest? Thanks.
> What do you suggest? In your reporting tool, I suggest you to use (branch_id, local_primary_key) as a primary key. This way they are guaranteed unique. > > Thanks. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
anon permutation wrote: > For performance reasons, each branch must has its own database and a > centralized transactional system is not an option. > > I was considering just centralizing primary keys generation, but that > seems very slow too. > > Segmenting primary keys among the branches is doable, but it is too much > of a maintainence nightmare. > > What do you suggest? We have a similar application. What we did is this: 1. Each database instance is assigned a unique identifier, stored in a 1 row, 1 column table (with a trigger to ensure it stays that way). 2. Write a function that can take two integers, convert them to text, and concatenate them. In our case we convert to hex and concatenate with a delimiter character. 3. Write another function, called something like 'nextrowid', that takes a sequence name as its argument. Use the sequence name to get the next value from the sequence, lookup the local unique identifier from the table defined in #1, and pass them both to the function defined in #2. 4. Use nextrowid('seq_name') to generate your primary keys. HTH, Joe
You could also say you can have 2^N databases and 2^(63-N) records in each database, and use a BIGSERIAL with the N higher bits pointing to the DB number, and the 63-N lower bits being the actual serial... faster than strings for indexing, and you init the serial to start at the first value for this database, and to stop at the last value for this database... just another way to generate GUID's... On Mon, 03 Jan 2005 07:20:54 -0800, Joe Conway <mail@joeconway.com> wrote: > anon permutation wrote: >> For performance reasons, each branch must has its own database and a >> centralized transactional system is not an option. >> I was considering just centralizing primary keys generation, but that >> seems very slow too. >> Segmenting primary keys among the branches is doable, but it is too >> much of a maintainence nightmare. >> What do you suggest? > > We have a similar application. What we did is this: > > 1. Each database instance is assigned a unique identifier, stored in a 1 > row, 1 column table (with a trigger to ensure it stays that way). > > 2. Write a function that can take two integers, convert them to text, > and concatenate them. In our case we convert to hex and concatenate with > a delimiter character. > > 3. Write another function, called something like 'nextrowid', that takes > a sequence name as its argument. Use the sequence name to get the next > value from the sequence, lookup the local unique identifier from the > table defined in #1, and pass them both to the function defined in #2. > > 4. Use nextrowid('seq_name') to generate your primary keys. > > HTH, > > Joe > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if > your > joining column's datatypes do not match >
Thank you for the feedback. I am considering just centralizing pk generation to the HQ. Basically, every so often, the branch will connect back to HQ to get back a block of pk for each table. This way, performance impact is limited and maintainence is simple. thanks. >From: Joe Conway <mail@joeconway.com> >To: anon permutation <anonpermutation@hotmail.com> >CC: pgsql-general@postgresql.org >Subject: Re: [GENERAL] Merging Data from Multiple DB >Date: Mon, 03 Jan 2005 07:20:54 -0800 > >anon permutation wrote: >>For performance reasons, each branch must has its own database and a >>centralized transactional system is not an option. >> >>I was considering just centralizing primary keys generation, but that >>seems very slow too. >> >>Segmenting primary keys among the branches is doable, but it is too much >>of a maintainence nightmare. >> >>What do you suggest? > >We have a similar application. What we did is this: > >1. Each database instance is assigned a unique identifier, stored in a 1 >row, 1 column table (with a trigger to ensure it stays that way). > >2. Write a function that can take two integers, convert them to text, and >concatenate them. In our case we convert to hex and concatenate with a >delimiter character. > >3. Write another function, called something like 'nextrowid', that takes a >sequence name as its argument. Use the sequence name to get the next value >from the sequence, lookup the local unique identifier from the table >defined in #1, and pass them both to the function defined in #2. > >4. Use nextrowid('seq_name') to generate your primary keys. > >HTH, > >Joe