-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent Unknown error 258 with no obvious cause #1530
Comments
As we have seen before in issue #647 the underlying reason may come from a different problems which may not be entirely drivers fault. any interruption in connectivity a missed port or socket failure could lead to this issue. We cannot say more without having more on the context of application. The most helpful step could be a minimal repro which it is usually impossible to create. However we can capture EventSource traces and see what has gone wrong. I would suggest closing the issue and follow #647. |
@JRahnama happy to close the issue and add my information to that ticket, but I want to be absolutely sure that the ticket referenced (which I've also included in the original report) is not just for the In the mean time, we'll hook up the EventSource you linked to gather some more information. Did not realise that was available! EDIT: Any suggestions on which event source traces to enable? |
@dantheman999301 if you use perfview it will capture all events. make sure you filter them by |
@JRahnama unfortunately due to the nature of this issue (only showing up in our production environment, which is in Linux Docker in AKS), I think perfview is not going to work for us, as nice as it would be. We've managed to wrangle the |
I can happily point you to PerfCollect for Unix machines, but not sure how it works on docker. it captures all events and you can transfer the files to a windows machine and investigate them. |
@JRahnama good news is, we managed to get the logs (we think). Bad news is, there are 200,000 of them. Unfortunately it didn't filter like I thought it would and as it was happening during a busy period and we have all Keywords on, there was a lot to collect. We need to work out how we're going to export them from Kibana, and obviously it's not as good as a full dump. Perfcollect might work but we'll need to work out how we're going to hook it up and run it over what is usually a period of an hour without having an impact on our production systems. We might also be able to use Let me know if it's of any use. |
We got same issue 5 days ago even our code was working and didn't change few days ago |
@dantheman999301 sorry for the late response we got busy with preview release. Any kind of log that shows or help us to understand where it happens would be helpful. |
Same questions applies to your case as well. Any repro or tracing logs would be helpful. I would also suggest investigation network traces as well. that could clarify some of the underlying issues. |
258 timeout is a common exception when the DTU limit is reached on Azure SQL. If on Azure, you can try to monitor the Max DTU percentage and see if it hits the limit. |
@oyvost In the issue we are facing we can see that the connection is not even made to SQL Server so it's not a throttling issue or similar. For example, we had an error about 20 minutes ago and it was utilising 4% of the available DTUs. @JRahnama just to get back to you, the logs we thought we had turned out not to be any good, there was a lot of duplication due to the way we wrote the Event Listener. The closest we've got to any answers on this is that we think it's timing out trying to get a connection from the connection pool even though from what we could tell from the event counters there seemed to be plenty of available connections in the pool. We're not overly confident this is the cause but it's something to go off. We did see this error once in one of the services that is prone to showing the other error I posted.
|
Hello, |
@DLS201 can you post the stack trace of the exception? have you checked the tcp and socket events/logs? |
Hi,
Database in an Azure SQL instance, no error on this side. |
@DLS201 Hello! Any progress on this? I'm encountering the same issue, but on three different occassions. I've documented them as a question on Stackoverflow. |
@mekk1t out of interest, what infrastructure have you got? We've only ever been able to narrow it down (we think) to a timeout waiting for a connection from the connection pool. That said upping the connection pool amount and timeout seems to make no difference. |
Hello, Regards, |
@DLS201 any chance you have more information on this? Maybe a link to a bug or something? We are investigating this issue currently and would like to know what MSFT knows... |
I am having the exact same issue with this setup: Microsoft.Data.SqlClient version: 4.1.0 |
@DLS201 do you have a scope? Does it happen for any .net client code running anywhere in Azure? In my case I'm having tons of problems getting a .Net solution to run correctly in Spark for Synapse. I'm opening a support case as well. But you could save us all many, many days of time with CSS if you would point us in the right direction. Generally CSS seems to not be highly technical, and they seem just as far removed from the Azure product groups as I am! Interestingly I can run my spark cluster locally, outside of azure and I have no issues connecting to the same SQL database over expressroute. It is only when my .net code also runs in azure that I have trouble connecting to Azure SQL. This seems counter-intuitive. (Another odd datapoint - the exact same code of mine works fine when running on azure-databricks VM's , which means it may be a problem related specifically to how synapse interacts with azure private resources) I suspect the problem is with the newer/proprietary layers that Microsoft has been putting on top of their VNET's these days (ie. "managed vnets", and "managed private endpoints"). I have noticed other azure services are extremely flakey as well (eg. Power BI's managed-vnet-gateway and ADF pipelines are pretty unpredictable. I have opened support tickets for both of these as well.) |
@DLS201 Please elaborate! We're seeing this issue from time to time, too in Azure AKS running aganst Azure SQL Elastic Pool DB's. |
Would appreciate more details about driver namespace + version + target framework in use, to help us identify, track and unblock you with known issues. |
You mean like:
|
OK, so going more down the troubleshooting rabbithole and reading up on some things (especially in connection with AKS), it seems our issue with this was actually SNAT port exhaustion on the outbound load balancer for the AKS cluster. Diagnose port exhaustion: https://docs.microsoft.com/en-us/azure/load-balancer/troubleshoot-outbound-connection We have increased the ports per backend instance (to a ridiculously high number :) ) - so we hope to never see this again |
Actually, it happened again, and this time without hitting the SNAT limit :( |
Since you're able to reproduce the issue in AKS, does it reproduce the exceptions in docker containers or Unix env locally as well? |
Came across this website while searching for a solution: Can it be related to ThreadPool.MaxThreads ? I have not tried to adjust MaxThreads myself yet. |
same here, this is my config... still happen :( why??? |
I'm having the same issue. Any solutions? |
I encountered a similar issue where my application was running on Docker within a Red Hat VM. I faced several weeks of challenges while trying to resolve it. Eventually, I decided to switch to a Windows environment, and to my surprise, I encountered no further issues. I suspect that there might have been a network issue within my Linux VM with the windows SQL Server database. |
I am having the same problem.
The table has 10 columns. It occurs in single-threaded console applications, even when accessed from a single instance. |
@SharePointX have you tried Microsoft.Data.SqlClient? Is this constantly happen on your application? Can you share a sample repro? |
I swapped to Microsoft.Data.SqlClient to check and still got the same error.
The other thing I found out is that the error occurs only when querying a table with a very large number of records in the same DB instance. |
@SharePointX have you tried increasing the connection timeout to make sure it won't be timed out while retrieving data from the server and how it affects the execution? I'd suggest retrieving data in smaller batches if it fails with large portions to find the balance. There are some thread starvation scenarios with thread pool and connection pool, which I believe don't apply to your case with a single-threaded app. I don't have any idea about warming up the database, you may ask the SQL Server Dev team. |
I revisited this problem and stumbled upon something intriguing while using EF Core 7.0. I discovered that both Here's how I configured the database service:
When calling the stored procedure, I had to include the following code:
I initially assumed that |
@oyvost You can set it in the connection string. |
Just want to mention that we experienced a similar issue and converting to Windows seems to have solved the problem. |
I can't switch to windows to solve this. |
Would it be possible to hear from the SqlClient team as to whether any fixes in the latest release (5.1.2) may help with this issue? https://github.com/dotnet/SqlClient/blob/main/release-notes/5.1/5.1.2.md |
If you have chronic problems or outages, please open a support ticket with Microsoft and share the outcome here. I think that type of information will allow us to proceed more productively. I'm a "big-data" developer in Azure, and I have suffered from chronic socket exceptions in multiple Azure platforms - just as everyone else is describing. The main pattern I've noticed is that they happen within Microsoft's proprietary VNET components (private endpoints). They are particularly severe in multi-tenant environments where several customers can be hosted on the same hardware (or even in containers within the same VM. The problems are related to bugs in Microsofts software-defined-networking components (SDN proxies like "private endpoints" or "managed private endpoints"). I will typically experience these SDN bugs in a large "wave" that impacts me for an hour and then goes away. The problems have been trending for the worse over the past year, and I've opened many tickets (Power BI, ADF, and Synapse Spark). The problems are not likely to happen on normal TCP (public internet ports). Nor are they common on VNET traffic when "peering" is used. It seems to be a problem that is specific to the funky private endpoints in Azure. I would be interested to know if others can related to this pattern? Also, the socket exceptions are not specific to SQL. They impact MSAL, ADLS, REST, etc. The underlying TCP exceptions are generally the same in all cases ("connection reset" or "connection timeout"). The Microsoft support cases I have already opened have not gone well. We are a small customer and Microsoft has not taken our problems seriously. I think they are already well-aware of the bugs in their private VNET technology and are trying to keep them hidden. They undermine the severity of these problems by repeating the word "transient" over and over like it was some sort of a magical incantation that makes the bugs more acceptable. To me it seems totally unacceptable to have issues (transient or otherwise) that trigger dozens and dozens of socket exceptions a day. I find that there are lots of Microsoft's own products that are impacted by these network bugs, notably ADF pipelines. They have had a "VNET IR" that suffers from a massive number of socket exceptions when interacting with resources in the local network (East US). The support teams actually say they have a formalized "policy" to require customers to implement retries on all pipeline activities (they suggest at least three retries with two minute delays in between them). This "policy" is simply a way to avoid the well-known bugs. ADF deployed the VNET IR several years ago and they have probably handled hundreds of CSS tickets about the fragile private endpoints in their managed vnet. IMPORTANT: Based on a three-year support case with another team (not ADF) I do have bug numbers that apply to private endpoints and these bugs are supposed to be fixed soon (not holding my breath). The following changes are for the "Azure Network stack": Feature ending with x509 : Support map space deletion in VFP without breaking flows. The ETA publicized is late 2024 Bug ending with x810 : PE-NC Flow mix up Sorry for the long post. I think it is easy for everyone to prove that these problems are specific to Azure, and don't affect on-premise scenarios with Sql Client. The challenge is getting Microsoft to help with them in Azure. I get the sense that there is a substantial lack of honesty and transparency when it comes to networking bugs in Azure. There was only one support scenario where Microsoft was willing to admit to the bugs I've experienced in their SDN technology. In every other Azure platform they are refusing to investigate or discuss the so-called "transient" network errors that cause TCP connections to break between resources that are in direct proximity to each other (0 ms latency). Here is a discussion of ADF's networking bugs. Hope this helps. I would appreciate it if those impacted by the socket errors would at least confirm that their scenarios involve VNETS, eg. private endpoints in particular, and possibly managed vnets. If you open a support case with Microsoft, please keep a count of the number of times they use the word "transient" (and they always say "communication failure" rather than socket exception). |
We did this. It was punted around between teams with no resolution so we closed it on our side because we were tired of spending time answering the same questions over and over. The workarounds posted here have helped but not eliminated the issue. We decided to stick with *nix App Service instances although the time spent futzing around on this issue has cost us more than the savings switching from Windows instances. |
@dbeavon We have a multi-tenant application running as a docker container on an Azure VM - connecting to Azure SQL over a "private endpoint" (which presents itself as a nic on our vnet). So yes I think this is a good match. I have not tried deleting the private endpoint and going via the public ip, i will give that a shot. It's too painful to go through support, we are a small team and the level of time required to get this over the line appears too much. |
@dazinator I certainly agree about the pain of support. But if Microsoft isn't working on these sorts of cases, then they will have less motivation to fix the underlying bugs . I doubt there is any telemetry that is captured for the so-called "transient" socket exceptions in their virtualized network technology. If only 1% of the impacted customers are making the effort to report these problems then Microsoft probably thinks the problems aren't very severe. And if Microsoft won't prioritize or fix these network bugs, they will impact us forever. Just as a case in point - the managed-vnet-IR in ADF has been buggy for ~4 years. And the managed vnet gateway in Power BI is also impacted by the same private endpoint bugs for ~3 years (the only difference here is that the Power BI team wisely allowed their component to remain in a "preview" state rather than pushing ahead with a premature GA). I have had a case open with the PBI team for the entire time this component has been in preview. The biggest pain for me has been with the Synapse platform. Hosting my Spark workloads on Synapse has been a real nightmare. I see dozens and dozens of socket exceptions each day, in every possible type of remote connection. However this did NOT happen to me when my Spark workloads were hosted on Azure Databricks (using an injected VNETs, rather than private endpoints). Nor do I see these socket exceptions when running a standalone Spark cluster on my own local hardware. It is pretty easy to distinguish the network environments that work fine from the ones that are bug-ridden. There is no excuse for the unreliable nature of Microsoft VNETs. The TCP transport protocol is already designed to be fault-tolerant. Microsoft has really changed the whole equation when it comes to the reliability of TCP connections.... It is especially annoying to think about the fact that these failed connections are taking place between servers that are sitting a hundred yards from each other (or less). ... I will have to find the Azure East US region on google earth to see how big it is. It is certainly not big enough to suffer from "random Internet glitches". I would have better network reliability running my Spark clusters on the Wifi at an airport than I would in Azure! |
@dbeavon - I know how painful and impossible it can be to track down transient network errors. I bet the MS team does too, hence the run around. Transient errors stink, I feel your pain. I do want to point out that this particular situation could or could NOT be related to network errors. We experienced that the We'll all need to work together to get this one sorted. |
It doesn't help that 258 seems to be used for a wide range of problems. Your query is slow? Unknown error 258. The network is congested? Unknown error 258. The DBA didn't like the way you looked at him? UNKNOWN ERROR 258. |
We went the other way, from public IP to private endpoint, if my memory serves me, it was after some advice we received from azure support when we asked to investigate it. It did not help much. |
It would be very easy to send a repro to one of these Azure teams, but they are always unwilling to pursue that option. I have a couple tickets open at the moment where I offered to give the CSS engineers a repro in order to experience these problems for themselves, and they are totally uninterested in taking that path. (It is strange to behold a vendor that doesn't care about fixing their bugs. When using on-prem software, vendors are typically VERY appreciative if you could provide a consistent reproducible that allowed them to independently review a bug on demand.) In my experience this problem isn't specific to SQL. In my Synapse-Spark clusters I have chronic socket failures across a number of remotely accessed services in East-US (MSAL, REST, SQL, etc). I just finished up with a support case involving the ADF team whereby they were experiencing a similar problem in their own component (ADF "Linked Service Resolver"). Their component was very unreliable as it was making connections to MSAL from the managed vnet. This particular MS team (that is responsible for the ADF "Linked Service Resolver") says the solution is that every TCP socket which is opened in Azure needs to be wrapped in retry loops that are repeated five times with two minute delays between them. I'm not sure about you folks but it seems absurd. The TCP protocol is supposed to be resilient and fault tolerant, so that application developers don't need to re-invent the wheel and wrap a ton of custom retry logic around every remote TCP connection. In the end, the ADF team "mitigated" their own component ("Linked Service Resolver") in some way - presumably by using a custom-written retry loop. Then they closed the case without giving any explanation for the elevated number of socket exceptions that were affecting us in the first place. I had hoped to learn from the experience and apply the information to the other types of failing sockets, but this ADF team really didn't give me anything useful as a take-away. Aside from my chronic socket exceptions in ADF/Synapse, I also have frequent outages that are hitting my service-health dashboard. The outages last thirty minutes at a time and come up on a monthly basis, and these outage notifications often use the word "transient communication failures" to explain the RCA. In other words, I think Microsoft product teams are suffering from the network problems themselves, even to the point of publicizing them when they become extreme. Hopefully they will eventually fix these problems for their own sake, (if not for ours.) |
For those moving to PostgreSQL to mitigate these issues, what are you using for the migration from MS SQL to PostgreSQL? |
Are you running into the Azure Load Balancer issue, where if the connection goes idle for more than 5 seconds, then the connection is terminated? Normal SQL keep-alive is 30 seconds. Lowering it to 3 seconds avoids the issue by ensuring there is never a gap of 5-seconds without some sort of traffic, even if it is a Keep-Alive packet. A cold SQL Server may be prone to having periodic pauses in getting/returning data, whereas a warmed-up server should have it in cache. This may explain the issues you are running into. You should be able to see this using a network trace. |
The issue can be easily reproduced by running an aspnet core web api project in a Linux container and making multiple concurrent calls to an endpoint to query a SQL Server database immediately after the service starts. The cost of opening each new thread on Linux appears to be very high, and making 30 concurrent queries with no available threads in the thread pool can take more than 60 seconds. If you do the same test running the service on Windows, it works perfectly. The only workaround for us to get this to work on Linux is to set the ThreadPool.MinThreads to a high number. |
@asierpn @dazinator @MichelZ |
It depends on your application, you have to test with different values to find the best one for you, we noticed an improvement when we set it to 20, but eventually we increased it to 100. |
@Malcolm-Stewart I'll try to answer some of your questions, I'm going from memory here as I raised this issue some time ago.
I don't believe we were going through an Azure Load Balancer for this. Given what others in this thread have suggested, I don't think this would be what I was seeing but it might be what others have experienced (see also the port exhaustion suggestions from early on).
I can say we some confidence this was not the issue. These errors happened sporadically, on hot or cold servers.
Certainly would be nice but the issue we always had was that these were running on containers in K8s and the errors were not easily reproduced. We'd have to leave network traces running for hours or days at a time to try and catch one in the act. There is @asierpn above who recently said this can be fairly easily produced just by running in a Linux Container and querying (I assume?) any SQL Server instance just as the service is starting With that said, we witnessed this on both busy and fairly dormant services that had been up and running for days at a time so it might be the same error message but a different root cause, very hard to say though. If there is anything we (the community) could do here to help nail this down, I'm all ears. Happy to have something running on a machine over a weekend to try and catch it along with some tracing but I'm not sure the best way to set this up. |
Just want to clarify one misconception.
While the cost of creating new threads is higher on Linux than on Windows, for example. It's not that cost that slows things down. The system thread pool implementation in .NET is designed to spin up new threads slowly, in a very conservative manner. If a library or application is spawning many tasks that hold threads for any length of time, it will take a while for the system thread pool to add threads to service the waiting queue of tasks. |
In our project, we also discovered that issue quite often, recently. After weeks of debugging, I was able to deterministically reproduce this issue. I am not quite sure if this is the cause for any of the incidents mentioned above, though. In our case the issue occurs with Entity Framework when the following criteria is met:
void Main()
{
const string longRunningQuery =
"""
WITH TenRows AS (
SELECT Value FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) TenRows (Value)
),
ThousandRows AS (
SELECT A.Value A, B.Value B, C.Value C from TenRows as A, TenRows as B, TenRows as C
)
SELECT A.A
FROM ThousandRows A, ThousandRows as B, ThousandRows as C;
""";
var context = new TestContext();
context.Database.SetCommandTimeout(1);
var values = context.Database.SqlQueryRaw<int>(longRunningQuery);
foreach (var value in values)
{
break;
}
}
// You can define other methods, fields, classes and namespaces here
public class TestContext : DbContext
{
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder.UseSqlServer("Server=localhost; Integrated Security=True");
}
} The reason is that the underlying data reader is being disposed (closed) when the enumerator is being disposed. This does not cancel the underlying db command. See also: https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.sqlclient.sqldatareader.close?view=sqlclient-dotnet-standard-5.2#remarks Interestingly, querying the data to completion may take longer than the configured command timeout in total. But as soon as the reader is being closed the timeout seems to kick in. In the example above, the instruction pointer will "hang" on the closing foreach bracket for the amount of time configured. The code above can be rewritten to something which is similar to: const string longRunningQuery =
"""
WITH TenRows AS (
SELECT Value FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) TenRows (Value)
),
ThousandRows AS (
SELECT A.Value A, B.Value B, C.Value C from TenRows as A, TenRows as B, TenRows as C
)
SELECT A.A
FROM ThousandRows A, ThousandRows as B, ThousandRows as C;
""";
var connection = new SqlConnection("Server=localhost; Integrated Security=True");
var command = connection.CreateCommand();
command.CommandText = longRunningQuery;
command.CommandTimeout = 1; // seconds
connection.Open();
var reader = command.ExecuteReader();
// Closing/Disposing the reader will drain all pending records. If the time this takes,
// starting from the close attempt, will exceed the command timeout, error 258 will be thrown.
// If the associated command would have been canceled, the code would work fine:
//
//command.Cancel();
//
// Remarks: If we read the reader from user code regularly, the timeout won't have any
// effect. At least when the first record is being delivered in time:
//
//while (reader.Read());
reader.Close(); Thus, in my opinion, the issue is related with dotnet/efcore#24857. Although the underlying issue is ADO.NET related, OR mappers like EF Core IMO must address this somehow, as it lacks an API to actually cancel the query. Our scenarios, for example, are a little more complicated. We attempt to stream out data from the database over HTTP to a client while performing some transformation. When this kind of transformation fails for some reason, this leads to the error and even hides the original exception due to the underlying finally block: async Task Main()
{
await StreamToClient(CancellationToken.None);
}
async Task StreamToClient(CancellationToken cancellationToken)
{
const string longRunningQuery =
"""
WITH TenRows AS (
SELECT Value FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) TenRows (Value)
),
ThousandRows AS (
SELECT A.Value A, B.Value B, C.Value C from TenRows as A, TenRows as B, TenRows as C
)
SELECT A.A
FROM ThousandRows A, ThousandRows as B, ThousandRows as C;
""";
var context = new TestContext();
context.Database.SetCommandTimeout(1);
var records = context.Database.SqlQueryRaw<int>(longRunningQuery).AsAsyncEnumerable();
await foreach (var record in records.WithCancellation(cancellationToken))
{
var transformedRecord = await Transform(record, cancellationToken);
// Send to client
}
}
async Task<string> Transform(int record, CancellationToken cancellationToken)
{
await Task.Yield();
throw new InvalidOperationException();
}
// You can define other methods, fields, classes and namespaces here
public class TestContext : DbContext
{
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder.UseSqlServer("Server=localhost; Integrated Security=True");
}
} Even cancellation triggers the error, although it is internally bound to cancelling the db command eventually: async Task Main()
{
using var cts = new CancellationTokenSource();
cts.CancelAfter(TimeSpan.FromSeconds(2));
await StreamToClient(cts.Token);
}
async Task StreamToClient(CancellationToken cancellationToken)
{
const string longRunningQuery =
"""
WITH TenRows AS (
SELECT Value FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) TenRows (Value)
),
ThousandRows AS (
SELECT A.Value A, B.Value B, C.Value C from TenRows as A, TenRows as B, TenRows as C
)
SELECT A.A
FROM ThousandRows A, ThousandRows as B, ThousandRows as C;
""";
var context = new TestContext();
context.Database.SetCommandTimeout(1);
var records = context.Database.SqlQueryRaw<int>(longRunningQuery).AsAsyncEnumerable();
await foreach (var record in records.WithCancellation(cancellationToken))
{
var transformedRecord = await Transform(record, cancellationToken);
// Send to client
}
}
async Task<string> Transform(int record, CancellationToken cancellationToken)
{
await Task.Yield();
return record.ToString();
}
// You can define other methods, fields, classes and namespaces here
public class TestContext : DbContext
{
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder.UseSqlServer("Server=localhost; Integrated Security=True");
}
} If we add an interceptor cancelling the command on reader close, the issue goes away, immediately: void Main()
{
const string longRunningQuery =
"""
WITH TenRows AS (
SELECT Value FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) TenRows (Value)
),
ThousandRows AS (
SELECT A.Value A, B.Value B, C.Value C from TenRows as A, TenRows as B, TenRows as C
)
SELECT A.A
FROM ThousandRows A, ThousandRows as B, ThousandRows as C;
""";
var context = new TestContext();
context.Database.SetCommandTimeout(1);
var values = context.Database.SqlQueryRaw<int>(longRunningQuery);
foreach (var value in values)
{
break;
}
}
// You can define other methods, fields, classes and namespaces here
public class TestContext : DbContext
{
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder.UseSqlServer("Server=localhost; Integrated Security=True");
optionsBuilder.AddInterceptors(new CommandCancelingInterceptor());
}
}
public class CommandCancelingInterceptor : DbCommandInterceptor
{
public override InterceptionResult DataReaderClosing(DbCommand command, DataReaderClosingEventData eventData, InterceptionResult result)
{
command.Cancel();
return base.DataReaderClosing(command, eventData, result);
}
} I saw some people using Dapper. We're using Dapper in our project as well. However, I did not yet attempt to reproduce the issue with Dapper. I am pretty sure, though, that any OR mapper is affected that behaves in the same way. I hope this helps some of you guys and I'd be interested if the other things you tried might correlate with our observation. We also tried to scale up the database or adjust some screws and it helped. But I think this was due to the query we were executing always "scratched" on the default command timeout of 30 seconds, exceeding it only occasionally. And scaling up the database lead to reader disposal finishing in time. |
@DerPeit glad you got to the bottom of your variant of the issue.
On this point, I think what you saw and what we saw were different issues. We never saw these queries make their way to the database, or at least not that we could find. The queries themselves would not have been long-running (e.g. selecting a single record by ID) and they could happen when the DB was using less than 10% DTUs. |
The team should look into your comment, we got the 258 error when calling |
Describe the bug
On occasions we will see the following error
However, SQL Server shows no long running queries and is not using a lot of it's resources during the periods where this happens.
It looks to be more of an intermittent connection issue but we're unable to find any sort of root cause.
To reproduce
We're not sure of the reproduction steps. I've been unable to reproduce this myself by simulating load. From what we can tell this is more likely to happen when the pod is busy (not through just HTTP, but handling events from an external source) but equally it can happen randomly when nothing is really happening on the pod which has caused us quite a substantial amount of confusion.
Expected behavior
Either more information on what the cause might be, or some solution to the issue. I realise the driver might not actually know the issue and it may really be a timeout to it's point of view. We're not entirely sure where the problem lies yet, which is the biggest issue.
Further technical details
Microsoft.Data.SqlClient version: 3.0.1
.NET target: Core 3.1
SQL Server version: Microsoft SQL Azure (RTM) - 12.0.2000.8
Operating system: Docker Container - mcr.microsoft.com/dotnet/aspnet:3.1
Additional context
TimeoutEvent
from the metrics that are collected from the pool. On occasions when we do get them, theerror_state
will be different.145
. We don't know what this means can find no information on what these relate to. I've raised a ticket with the Azure Docs team to look at this. I'll add more onto this when they happen as we've not been keeping track of the error_state codes as we're not sure if they're even relevant.ReadSniSyncOverAsync
The text was updated successfully, but these errors were encountered: