Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase in lost lock after upgrading to .NET 7, EF Core 7 and Microsoft.Data.SqlClient 5.0.1 #160

Closed
OskarKlintrot opened this issue Jan 17, 2023 · 21 comments

Comments

@OskarKlintrot
Copy link

We've seen a dramatic increase in number of lost locks since we upgraded to .NET 7 and EF Core 7. EF Core 7 uses Microsoft.Data.SqlClient (>= 5.0.1) but EF Core 6 used Microsoft.Data.SqlClient (>= 2.1.4) (DistributedLock.SqlServer have a dependency to Microsoft.Data.SqlClient (>= 2.1.1)).

We started upgrading our services before Christmas and around the same time the amount of lost locks started to increase from a few a week to 10-50 a day.

image

The only change we have done that we figure might have an impact is upgrading from .NET 6 to .NET 7 and Microsoft.Data.SqlClient from 4.1.0 to 5.0.1.

@hoerup
Copy link

hoerup commented Jan 17, 2023

Have you tried with Microsoft.Data.SqlClient 5.1.0-preview2 ?

https://github.com/dotnet/SqlClient/releases/tag/v5.1.0-preview2

@OskarKlintrot
Copy link
Author

No, I haven't. For some reason we don't lose locks in our test env but in our acceptance test env (and production, which the graph is from). I'll deploy the two worst offenders with the preview version to the acceptance test env and we'll see in a few days if it makes any changes.

@madelson
Copy link
Owner

Can you share more detail?

  • Confirm that you are using the latest DistributedLock.SqlServer
  • How are you creating the lock?
  • How are you acquiring the lock?
  • How are you detecting lost locks? HandleLostToken I assume?
  • What exceptions are you seeing?

@OskarKlintrot
Copy link
Author

Confirm that you are using the latest DistributedLock.SqlServer
Yes, version 1.0.2.

How are you creating the lock?
How are you acquiring the lock?

private async Task<IDistributedSynchronizationHandle?> GetHandleAsync(CancellationToken cancellationToken)
{
    using var _ = _logger.BeginScope(_state);

    while (!cancellationToken.IsCancellationRequested)
    {
        var failure = true;

        try
        {
            _logger.LogInformation("Trying to acquire lock for {Job Name}", _jobName);

            var handle = await _distributedLockProvider.AcquireLockAsync(
                _lockName,
                timeout: null,
                cancellationToken: cancellationToken);

            failure = false;

            return handle;
        }
        // Multiple catch clauses
        finally
        {
            if (failure)
            {
                await Delay.TryWait(TimeSpan.FromSeconds(5), cancellationToken);
            }
        }
    }

    return null;
}

How are you detecting lost locks? HandleLostToken I assume?
That's correct:

cancellationTokenRegistration = handle.HandleLostToken.Register(() =>
{
    // HandleLostToken.Register is to slow to use for anything else than logging.
    _logger.LogError("Lost lock for job {Job Name}.", _jobName);

    _telemetryClient.TrackEvent(new($"Lost lock for {_jobName}")
    {
        Properties =
        {
            { "Job Name", _jobname },
            { "Lock State", "Lost" }
        },
    });
});

What exceptions are you seeing?
I can't find any logged exceptions. I don't think we are suppressing any. I think we might be actually, doh... Once you got the handle, is there any exceptions being thrown?

I made this repo for a previous bug but that's pretty much still how we are using the library.

@OskarKlintrot
Copy link
Author

I noticed something really interesting/weird. 2/3 of the losses of the two services that loses the most locks occurs around midnight. In one env it's 23:45-00:00 and in another it's 00:00-00:05. I have no clue if it's related to this issue though, probably not. I wonder if Azure is doing something around midnight...

@madelson
Copy link
Owner

madelson commented Jan 18, 2023

Once you got the handle, is there any exceptions being thrown?

If the connection is broken, I would expect disposing the handle to throw an exception since it will try to execute the release sproc and that will fail. Are you seeing any exceptions there?

In one env it's 23:45-00:00 and in another it's 00:00-00:05

Are you able to repro this at all locally? I assume not since you said just in acceptance test...

How are you detecting lost locks? HandleLostToken I assume?
That's correct:

Is it possible for you to try and repro this without subscribing to HandleLostToken? I want to try and isolate whether this is an issue that is being caused by handle monitoring or whether it is being detected by handle monitoring.

@OskarKlintrot
Copy link
Author

If the connection is broken, I would expect disposing the handle to throw an exception since it will try to execute the release sproc and that will fail. Are you seeing any exceptions there?

I made a mistake and didn't realize that there could be an exception thrown from there. I accidently suppress any exceptions thrown during dispose. I'll have to fix that and deploy the services.

Are you able to repro this at all locally? I assume not since you said just in acceptance test...

Unfortunately, no. Maybe if I leave a job running for a few days on a computer but I'll hope I can get some exceptions that makes sense from production instead.

Is it possible for you to try and repro this without subscribing to HandleLostToken? I want to try and isolate whether this is an issue that is being caused by handle monitoring or whether it is being detected by handle monitoring.

Sure, I think I could just use .WasLost() instead so I still can log when the lock have been lost. I'll start with logging the exceptions to get a picture of what is happening before I change.

@madelson
Copy link
Owner

madelson commented Jan 18, 2023

Sounds good @OskarKlintrot . FWIW I ran the unit test suite with the 5.0.1 upgrade and everything passes, so no luck there.

Another thing that would be interesting to test on your end is to try different versions of Microsoft.Data.SqlClient and see which one introduced the behavior.

There are a lot of different suggestions flying around here so I thought it might make sense to track with a list:

  • Try latest prerelease version of Microsoft.Data.SqlClient
  • Try different versions of Microsoft.Data.SqlClient to see which one causes the issue
  • Log exceptions from Dispose() and report back
  • Try removing HandleLostToken subscription (but still logging Dispose exceptions to get a delayed sense of lock losses)
  • Try to create local repro (perhaps based on this one).

@hoerup
Copy link

hoerup commented Jan 20, 2023

@OskarKlintrot We did see some weird SQL connectivity issues in one of our microservices after upgrade to net7
We have just upgraded that service to the newly released Microsoft.Data.SqlClient 5.1.0 - and initial tests indicates that the issue is gone.

@OskarKlintrot
Copy link
Author

  • Try latest prerelease version of Microsoft.Data.SqlClient

I let it run for two days and on day 2 the behaviour come back. So no luck there I would say.

  • Try different versions of Microsoft.Data.SqlClient to see which one causes the issue

Since I can't go any lower than the one required by EF Core 7 I don't think this is feasible, unfortunately.

  • Log exceptions from Dispose() and report back

It doesn't seem to be any exceptions from Dispose(). On the same instance that lost it locks and at the same time I now see this exception:

BeginExecuteNonQuery requires an open and available Connection. The connection's current state is closed.

image

Stack trace:

System.InvalidOperationException:
   at Microsoft.Data.SqlClient.SqlCommand.ValidateCommand (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQuery (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryInternal (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryAsync (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncImpl (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQueryAsync (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Medallion.Threading.Internal.Data.DatabaseCommand+<ExecuteAsync>d__12`1.MoveNext (DistributedLock.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.Core/Internal/Data/DatabaseCommand.cs:151)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.ConfiguredValueTaskAwaitable`1+ConfiguredValueTaskAwaiter.GetResult (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Medallion.Threading.SqlServer.SqlApplicationLock+<ExecuteReleaseCommandAsync>d__16.MoveNext (DistributedLock.SqlServer, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.SqlServer/SqlApplicationLock.cs:71)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Medallion.Threading.Internal.Data.MultiplexedConnectionLock+<ReleaseAsync>d__12`1.MoveNext (DistributedLock.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.Core/Internal/Data/MultiplexedConnectionLock.cs:159)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Medallion.Threading.Internal.Data.MultiplexedConnectionLock+<ReleaseAsync>d__12`1.MoveNext (DistributedLock.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.Core/Internal/Data/MultiplexedConnectionLock.cs:173)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at My code+<ExecuteAsync>d__9.MoveNext (My Assembly, Version=7.1.2.0, Culture=neutral, PublicKeyToken=null: /_/Hosting/My Assembly/Lock/NonConcurrent/NonConcurrentLock.cs:115)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at My Code+<ExecuteContinuously>d__15.MoveNext (My Assembly, Version=7.1.2.0, Culture=neutral, PublicKeyToken=null: /_/Hosting/My Assembly/CronJob.cs:324)
  • Try removing HandleLostToken subscription (but still logging Dispose exceptions to get a delayed sense of lock losses)

Do you still think this is worth doing? (see below)

  • Try to create local repro (perhaps based on this one).

On hold (see below)


Given that the exceptions (there a some other as well from another occasion all related to db connection issues, Azure admits the DB became unavailable at that time) points to some hiccup with the db connection I doubt that this issue has anything to do with DistributedLock but rather Microsoft.Data.SqlClient and Azure SQL Server. Do you (@hoerup and @madelson) agree with me? If so I think we can close this issue. I can always re-open it if I find something new.

@OskarKlintrot
Copy link
Author

@OskarKlintrot We did see some weird SQL connectivity issues in one of our microservices after upgrade to net7 We have just upgraded that service to the newly released Microsoft.Data.SqlClient 5.1.0 - and initial tests indicates that the issue is gone.

@hoerup Oh, that's very interesting! Are you also using Azure SQL perhaps? When I upgraded to the latest preview everything seemed fine for a day until the issues come back. Please let me know if it works for you! If it does we will probably also upgrade to 5.1.0.

@OskarKlintrot
Copy link
Author

OskarKlintrot commented Jan 20, 2023

This issue seems somewhat similar to what I am seeing, not sure if this issue is related. Is there any retry-logic in DistributedLock.SqlServer btw?

@hoerup
Copy link

hoerup commented Jan 20, 2023

@OskarKlintrot We did see some weird SQL connectivity issues in one of our microservices after upgrade to net7 We have just upgraded that service to the newly released Microsoft.Data.SqlClient 5.1.0 - and initial tests indicates that the issue is gone.

@hoerup Oh, that's very interesting! Are you also using Azure SQL perhaps? When I upgraded to the latest preview everything seemed fine for a day until the issues come back. Please let me know if it works for you! If it does we will probably also upgrade to 5.1.0.

I might have been to fast there - just saw some connection errors again :(

we are using SQL server on-prem

@OskarKlintrot
Copy link
Author

I might have been to fast there - just saw some connection errors again :(

Too bad!

we are using SQL server on-prem

Sounds like the issue is with Microsoft.Data.SqlClient then. Are you seeing the issues from other places than DistributedLock.SqlServer as well? We do, we see them from EF as well.

@madelson
Copy link
Owner

Is there any retry-logic in DistributedLock.SqlServer btw?

There isn't any retry logic. Especially once we grab the lock if the connection dies we can't really retry at that point. Even if we reaquire the lock someone else still could have held it during the blip so we want to report as a lost handle.

I doubt that this issue has anything to do with DistributedLock but rather Microsoft.Data.SqlClient and Azure SQL Server

we see them from EF as well.

@OskarKlintrot that does seem likely if you're encountering a spike across uses of Microsoft.Data.SqlClient and not just in distributed lock

As a next step, I think it makes sense to report an issue on the Microsoft.Data.SqlClient github and link it here. Do you agree? Then we can close this for now while that plays out. The issue you already linked does seem related but that was closed without action. If we were able to pin down the appearance of the problem to a specific release of the client that might help it gain traction, as would creating a standalone repro that can run against Azure.

@madelson
Copy link
Owner

@OskarKlintrot any updates here? Did you end up filing an issue with Microsoft?

@OskarKlintrot
Copy link
Author

Still working on it, got sidetracked with a man cold :)

@madelson
Copy link
Owner

madelson commented Feb 1, 2023

@OskarKlintrot sorry to hear that; hope you feel better soon :-)

@OskarKlintrot
Copy link
Author

Sorry, got buried in other work. Closing this for now and tracking the issue here. Might re-open this issue if Microsoft.Data.SqlClient don't think it has anything to do with them.

@OskarKlintrot OskarKlintrot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 24, 2023
@madelson
Copy link
Owner

Thanks!

@JRahnama
Copy link

JRahnama commented Mar 3, 2023

@hoerup will you be able to provide a minimal sample that could repro the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants