I've got multiple identical databases (distributed on several servers) and need to gather them to one single point to do data mining, etc.
The idea is to take Table1
, Table2
, ..., TableN
from each database and merge them and put the result into one single big database.
To be able to write queries, and to know from which database each row came from we will add a single column DatabaseID
to target table, describing where the row came from.
Editing the source tables is not an option, it belongs to some proprietary software.
We've got ~40 servers, ~170 databases and need to copy ~40 tables.
Now, how should we implement this given that it should be:
- Easy to setup
- Easy to maintain
- Preferably easy to adjust if database schema changes
- Reliable, logging/alarm if something fails
- Not too hard to add more tables to copy
We've looked into SSIS, but it seemed that we would have to add each table as a source/transformation/destination. I'm guessing it would also be quite tied to the database schema. Right?
Another option would be to use SQL Server Replication, but I don't see how to add the DatabaseID
column to each table. It seems it's only possible to copy data, not modify it.
Maybe we could copy all the data into separate databases, and then to run a local job on the target server to merge the tables?
It also seems like a lot of work if we'd need to add more tables to copy, as we'd have to redistribute new publications for each database (manual work?).
Last option (?) is to write a custom application to our needs. Bigger time investment, but it'd at least do precisely what we'd like.
To make it worse... we're using Microsoft SQL Server 2000. We will upgrade to SQL Server 2008 R2 within 6 months, but we'd like the project to be usable sooner.
Let me know what you guys think!
UPDATE 20110721
We ended up with a F# program opening a connection to the SQL Server where we would like the aggregated databases. From there we query the 40 linked SQL Servers to fetch all rows (but not all columns) from some tables, and add an extra row to each table to say which DatabaseID the row came from. Configuration of servers to fetch from, which tables and which columns, is a combination of text file configuration and hard coded values (heh :D). It's not super fast (sequential fetching so far) but it's absolutely manageable, and the data processing we do afterwards takes far longer time.
Future improvements could be to;
- improve error handling if it turns out to be a problem (if a server isn't online, etc).
- implement parallel fetching, to reduce the total amount of time to finish fetching.
- figure out if it's enough to fetch only some of the rows, like only what's been added/updated.
All in all it turned out to be quite simple, no dependencies to other products, and it works well in practice.