I have a scenario where several sets of source servers, in clusters, lets call them A1/A2, B1/B2, etc are all connected to a single Subscriber (call it S0) via SQL Server 2012 Enterprise Transactional Replication. The source servers are in clusters and use Always-On. The subscriber, S0, is a standalone SQL box. S0 has a unique DB for each publication from each publisher cluster. These are push subscriptions, and the distributor lives on the source clusters (each cluster has its own distributor).
This works very well for the most part, with billions of transactions being replicated just fine every day with no errors. Occasionally, we see issues where either a source cluster is failed over or the subscriber is rebooted, and a subscription will break with an error about a PK violation.
Here is a sample series of log entries from the publisher (in this scenario, the subscriber was rebooted for maintenance):
- 6:19 Repl Dist Subsystem: agent sched for retry: Named pipes provider: No process on other end of the pipe
- 6:21 Repl Dist Subsystem: Violation of PK constraint
- 6:24 Repl Dist Subsystem: Violation of PK constraint
- 6:27 repeats till we reinit...
Why is this happening? Shouldnt the distro agent know what it has already delivered?
Notes:
- Nothing else is writing to the subscriber DB
- The tables dont have identity columns or anything else odd on them
- The subscriber is read-only and is essentially a reporting replica of the source clusters
- The issue doesnt happen during every reboot, nor does it happen for every publication. On average, 1 publication out of 38 will have an issue on every 3rd reboot. The publication (and specific table) that errors is random and I have not seen any correlations for specific source clusters, times of day, table structures, etc.
Any help is appreciated!