I have a merge replication set up between one Publisher and 3 Subscribers, replicating a single database with around a 100 different tables of various sizes. It's "real-time" replication running continually 24x7, and it's worked fine for the last year or more: the volume of changes is small but constant throughout the day.
However, last night it all started to go wrong. Several tables in the database were "locked": applications couldn't access them. I did the usual reboot, and everything seemed fine, and I didn't think much of it. However, this morning the same thing was happening: vital tables were locked, and digging into sp_who2
showed that replmerg
was chewing on something.
I tried to reinitialise the subscription; this seems to have made things worse: the replmerge
log is telling me that the snapshot is out of date. When I try to run the snapshot agent - which I've done a few times in recent memory and it's never taken more than a couple of minutes - it gets 46% of the way through before telling getting stuck at:
[46%] Setting up the publication for filtered articles
... and eventually timing out. I've dug into the queries that are running when it's stuck, and it appears to be chewing on:
select @changes_in_tombstone = count(*)
from dbo.MSmerge_tombstone with (rowlock)
where generation = @g
I've looked at the MSMerge_tombstone
table, and it's got 21 million rows in it (a 14 day retention time): I've no idea if this is good or bad.
So I don't know how to proceed: I really don't want to have to delete and recreate the publication - there are changes at the subscribers that I don't want to lose - but it appears I'm stuck in a Catch 22: I need to create a new snapshot in order to get the replication going again, but I can't create a snapshot as it times out.
Should I resign myself to recreating the publication, or is there a way to fix this?