Having a merge publication for replicating BLOBs (type image
), got very high tempdb disk I/O for my size of data. Publication is download-only and has no filters.
High disk I/O is caused by synchronization (when no subscribers are synchronizing, everything is ok), strongly correlated with number of subscribers. It happens even when no data is changed at Publisher between synchronizations, and that bothers me.
- Size of replicated table: 7MB (total count of rows is about 100)
- tempdb I/O : ~30 MB/sec for write (log and data files)
- Number of subscribers: slightly over 100, each synchronizing every 30 minutes (more or less evenly).
- Retention period set to 14 days
Using SQL Server 2008 at Publisher, SQL Server 2005-2008R2 at Subscribers. All subscribers use Web Synchronization.
Additionally, synchronization at subscriber takes a lot of time, with multiple occurrences in replmerg.log
like these:
DatabaseReconciler, 2015/04/21 12:13:40.348, 3604, 25088, S2, INFO: [WEBSYNC_PROTOCOL] Sending client ReconcilerPhase WebSyncReconcilerPhase_RegularDownload DatabaseReconciler, 2015/04/21 12:13:47.063, 3604, 25194, S2, INFO: [WEBSYNC_PROTOCOL] Received server ReconcilerPhase WebSyncReconcilerPhase_LastRegularDownload
Tried setting @stream_blob_columns
on and off with no effect.
The question is: Is it a good idea to use merge replication to send these blobs to subscribers? We have other publications (though they have no BLOB columns) with a lot of data without tempdb problem. Is it an SQL Server flaw, or bad setup?
Publisher and Distributor are on the same instance, SQL Server 2008 SP4, cannot be sure about Subscribers, some of them maybe not up-to-date. Anyway, I can prepare a test environment with few subscribers having controlled versions, if it seems to help.
Confirmed, that excessive tempdb usage caused by execution of sys.sp_MSenumgenerations90
. Checked MSMerge_genhistory
table, found over 1.2 millions of records where pubid
is null.
Found this conversation with replication guru:
Executed sp_mergemetadataretentioncleanup
with no effect.
Found a remark on a case like this (too much rows in MSmerge_genhistory
) where deleting rows where pubid
is null and genstatus
=1 helped to fix replication.
Deleted rows where pubid
is null (implying that all Subscribers are synchronized, and which are not - will be reinitialized in "manual mode") and disk IO is back to normal again!
I have a feeling, that this situation could be caused by the fact, that all of my Subscribers are anonymous via WebSync and most of them have the same hostname. I'll try to check, if -hostname
key helps not to multiply records in MSmerge_genhistory
.