We are experiencing a specific statefull service that is unable to fully "go green", the partitions keep reshuffling and we are not seeing any indications of the errors in our own logs. After lots of digging we found something suspicious in the EventLogs on one of the WMs (pasted below)
Application: (hidden).exe Framework Version: v4.0.30319 Description: The application requested process termination through System.Environment.FailFast(string message). Message: ((copyMode & CopyMode.FalseProgress) == 0) || (sourceStartingLsn < targetStartingLsn). Source starting lsn : 2018, target starting lsn :2018 Stack: at System.Environment.FailFast(System.String) at Microsoft.ServiceFabric.Replicator.Utility.CodingError(System.String, System.Object[]) at Microsoft.ServiceFabric.Replicator.Utility.Assert(Boolean, System.String, ...) at Microsoft.ServiceFabric.Replicator.LoggingReplicator.GetLogRecordsToCopy(Microsoft.ServiceFabric.Replicator.ProgressVector, System.Fabric.Epoch, Microsoft.ServiceFabric.Replicator.LogicalSequenceNumber, Microsoft.ServiceFabric.Replicator.LogicalSequenceNumber, Int64, Int64, Microsoft.ServiceFabric.Replicator.LogicalSequenceNumber ByRef, Microsoft.ServiceFabric.Replicator.LogicalSequenceNumber ByRef, Microsoft.ServiceFabric.Data.IAsyncEnumerator'1 ByRef, Microsoft.ServiceFabric.Replicator.BeginCheckpointLogRecord ByRef) at Microsoft.ServiceFabric.Replicator.LoggingReplicatorCopyStream+d__3.MoveNext() at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Microsoft.ServiceFabric.Replicator.LoggingReplicatorCopyStream+d__3, Microsoft.ServiceFabric.Data.Impl, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]](d__3 ByRef) at Microsoft.ServiceFabric.Replicator.LoggingReplicatorCopyStream.GetNextAsync(System.Threading.CancellationToken) at System.Fabric.StateProviderBroker+AsyncEnumerateOperationDataBroker.b__8(System.Threading.CancellationToken) at System.Fabric.Interop.Utility.WrapNativeAsyncMethodImplementation(System.Func`2, IFabricAsyncOperationCallback, System.String, System.Fabric.Interop.InteropApi)
We are not sure what to make of this. Seems related to state replication but we don't think we've changed anything related to the state of the service. Since the service is exiting with FailFast, we don't get a chance to do anything in our code to remedy this so we are basically stuck in this loop right now (luckily on a non-Live environment but still...)
Does anyone have any idea what this is related to specifically and how we can recover the service and the data?