-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After migration from .NET7 to .NET8 SqlException 0x80131904 started to appear randomly when connecting to Azure SQL databases #2400
Comments
Please remove lavel "sqlite" - it was added by mistake, but I cannot change it anymore to "ms sqlclient" or similar. |
@JiriZidek Can you capture some EventSource logs. Also can you confirm by reverting back to Net7 the issue does not happen. |
I recommend you contact Azure support and have the networking team look into what is going on during those periods of failures. |
I am afraid there are no "periods". The problem appears randomly, but always after restarting the container. So from this point of view it is reproducible. |
Yes - we tested old version, which differs only by being .NET7 and all the nuget modules from December (EF, SqlClient, image aspnet:7.0) and yes, when restarted it does not yield this error. |
I realized one important point - I run the very same code as Azure Service, using Linux plan and .NET8 - and in this case there is not such problem with SqlClient. |
This confirms that your best option would be contacting Azure support. |
But how this plays with .NET7 works vs. .NET8 does not ? In same place, same AKS cluster ? I can put two PODs, one old, one new in parallel to decide if it is likely the network problem. Let you know. I'd like to avoid being pingponged back from Azure support. |
Just a note to confirm we're seeing the exact same issue in almost an identical setup:
Exception message:
Note that the error occurs incidentally (less than 0.5% of the calls), and we're seeing the exceptions since the day that the .NET upgrade was deployed. |
So positively confirmed. AKS, same K8s namespace. One POD with .NET8 and .NET 7. So for me this is NOT Azure problem. It is problem if .NET8 or problem of Debian 12 (which is bases of the image). But since I have observed same bad behavior on Unbuntu based images I bet on .NET8 or Microsoft.Data.SqlClient. I guess there is something wrong with TLS handshake. |
I'm having the same issue, there are no concurrent connections to the DB and no async operations happening in parallel on the same context.
|
We experienced the same SqlException (0x80131904), however in a slightly different environment, and I beleive I understand the source of this exception, at least in our environment.
TLDRThis exception occurs when there is a proxy between the client and the SQL Server AG Listener, and the proxy accepts TCP connections for both the primary and secondary SQL Server instances. The mechanism SqlClient uses to determine which host is primary is by opening a Socket to all IPs returned by the DNS lookup, and the first socket to open is considered the primary and the other connections are disposed. (See ParallelConnectAsync and ParallelConnectHelper from SNITcpHandle.cs). Under normal circumstances, without the proxy, the secondary instance is not listening on port <listener-ip>:1433 and therefore the client cannot open a socket. But when a proxy is between the client and server, the client can open the socket and even perform the TCP handshake with the proxy for both the primary and secondary SQL Server Instances. If the socket is opened to the secondary first, then as soon as any data is sent this exception is thrown. Environment
pod.yaml
service.yaml
DescriptionThe first issue with this, is that Istio will create a Virtual IP (VIP) for the ServiceEntry ex: 240.240.240.10 and any DNS queries performed within the pod for the listener name will resolve to this VIP. In addition to the VIP, the Istio proxy sidecar will do DNS resolution on that hostname and put both the primary and secondary IPs into a pool and perform load balancing between all IPs in the pool. This means sometimes when SqlClient will get sent to the secondary instance. Disable the VIP and Set the Service resolution to NONE
With the VIP disabled the the pod's DNS queries will now resolve to the 2 SQL Listener IPs, and with the ServiceEntry resolution set to NONE, connections will be routed/forwarded to the IP requested. This however does not solve the root of the issue. Istio Proxy is listening and accepting connection on 1433. SqlClient attempts to open 2 connections to the 2 IPs and because Istio is listening both sockets will succeed. If the primary is opened first then all is good. If the secondary is opened first then will perform the TCP handshake and then TLS / TDS handshake which is where the exception is thrown. Bypass Istio
This will cause any traffic on port 1433 to completely bypass Istio and traffic will go directly to SQL Server without issue. Possible FixIn SqlClient, delay the decision for which IP/Host is considered primary to after the TLS handshake / TDS Handshake is attempted instead of using Socket.Connect() and perform the handshake in parallel for all IPs. If a socket is opened to the secondary listener ip it will fail at the handshake, otherwise the Socket.Connect() will timeout and in the mean time the primary handshake can proceed, in which case discard the rest of the connections. |
👍 In my recent experiments this makes app almost unusable with Azure SQL. I planned to use this in POC project. |
The only reason I mentioned Pooling=False is because it is useful if anyone attempts to re-create the issue and wants to reliably reproduce it. It is unlikely you would want to Pooling=False in a real world application |
I am also facing the same issue as well.
|
Ignore my comment from before, the issue I am facing is not related to this issue. |
Checking in, has someone tested with MDS 5.2.1? |
Yes - no change, the same problem. |
@kf-gonzalez2 @JRahnama could you please help prioritize the issue investigations? Thanks. |
Just some notes after few month of living with this bug:
|
@JiriZidek Can you share some event source traces logs please? |
@JiriZidek Can you comare OpenSSL versions in Azure Service (Linux) and Debian one? |
We noticed that TLS issue is resolved via resource bump, but this error 35 still occurs:
NET8, 5.2.2 client, completely impossible to do a long-running query due to this :( Settings in the conn string:
|
Update for this - we are using a proxy solution and it seems that this is somehow related to tcp forwarding. When we use direct connection to the server, the error does not occur. |
Some proxies (such as Envoy and Istio) accept the TCP connection from the client themselves, then try to open their own TCP connection to the container/server being published when the client tries to read or write data. At present, MultiSubnetFailover picks the first socket to successfully establish a TCP connection, which can clash with the TCP proxy's behaviour if one of the backend servers/containers is unavailable. A DNS name which resolves to the IP addresses of all required servers should work (although each name may resolve to a maximum of 64 addresses.) Out of interest: have you encountered a specific use case for putting SQL Server behind a TCP-level proxy, or was it simply used because it's the default connectivity method for inter-container traffic? Simply proxying the TCP traffic to a server seems fairly dangerous unless there's something to keep the databases in sync between the servers being proxied to. This is (roughly) where I'd hope to see an availability group and read-only routing. |
@edwardneal thanks for the explanation. We do not host this server, the reason we use TCP proxy is because its IP-whitelist walled. Our use case is streaming multiple tables for changes, container/table. Thus when connecting from a private network we have to NAT the traffic, which in this case is very expensive. Plus our private network is v6, while the target server is v4 only. So we have a set of nginx proxies that take public IP from pre-assigned pool, in a dualstack network, routing the connections. |
Thanks @george-zubrienko. The scenario you've described makes sense, and doesn't have the danger I was initially thinking about (where the reverse proxy was publishing a number of different SQL Server instances without guaranteeing the consistency of their data.) Do you see the same problem if you route and NAT the traffic using a router or a firewall rather than an L4 proxy? I wouldn't immediately expect either of these to decouple the frontend and backend connection establishment as an L4 proxy does. |
@edwardneal we do use NAT gateway in AWS for low-traffic applications and I've never seen a connection loss there with this kind of error. It's not a 100% fair comparison since those use JDBC driver, but I'd presume it doesn't make any difference in this case. We also initially tested everything with NAT gateway and this issue never occurred. |
@edwardneal so we tried to tune buffers/timeouts in https://nginx.org/en/docs/stream/ngx_stream_proxy_module.html without any success. We'll switch to NAT instance instead and I'm pretty sure this will solve the problem. However it is absolutely unclear why do we see |
My best whishes! But in my AKS there is nothing special, only Azure-clean IP and the problem really exists from .NET 8.0.0 - I have theory that the problem is in combination Linux+net8 interop for SSL/TLS and Encryption. No proofs yet. If you are able to get ETL trace, that would help the guys. I'm not able to get it from a POD. |
@george-zubrienko thanks for the update. I'm not sure why you're seeing those errors though. If SQL Server's triggering the connection reset, only it is going to know why and I don't know whether that type of logging is exposed. The SQL Server error log might contain an explanation, but it might also be triggered by anything along the network (or, as a long shot, nginx might not be sending TCP keepalives...) @JiriZidek are you able to add dotnet-trace to the pod? I can see one article which refers to building the image with these tools installed, and one article which installs them into a single pod (with all the caveats that approach implies!). If the image doesn't include curl, you should be able to use If this isn't working, I can see a way to trace to a file using environment variables documented here. I've never tried this, but it might be easier to work with. |
@edwardneal so we were unable to identify the way to fix this with nginx. We tried HAProxy instead and the problem was gone. I do not know whats so special about ngx stream proxy module, but problem was definitely in how it handles upstream connections. |
I am currently investigating pretty much the same behavior as @JiriZidek in the op and I can confirm a downgrade fixes the issue (tested with dotnet 6 in my case). Also setting the MinPoolSize to 2 seems to reduce the errors somewhat. I want to note however, only downgrading the Microsoft.Data.SqlClient package does not help for me. I have to downgrade the dotnet version and that includes a different base image (from Debian 12 to Debian 11). |
I think the issue is tied-up with underlying infrastructure configuration. For me the issue is happening in lower environments with lesser config, but absolutely no issues with same code base and same docker image in higher environment with higher configuration. |
Yes, I have tested the 6.0.0-preview as well and it seems rather ortogonal to Microsoft.Data.SqlClient lib team. Since I was not able to do ETW traces of the problem, I know it is my debt to solving the problem. Anyway today I wait for .net 9 with hope it will solve it. If not, I'll bite the bullet and do all the tracing ETW on my own. Not happy with that. |
I do not know what you exacly mean by "higher configuration" but I take for almost sure, that the problem exists globally, but on systems with steady traffic, when the number of SqlConnections in pool is constant, you do not percieve the problem. Try to switch off the connection pooling and the problem will surface up ! |
Since you mentioned having no problems in an azure function I switched to the Azure Linux 3.0 base image in my test environment and am not seeing the error for a few hours now. If that stays the same over the weekend I will do a test in prod. |
Lower config => AKS node size of Standard_D2s_v4 @JiriZidek I have very less traffic in one of the env with higher config and not facing any issues |
I have tested Debian and Ubuntu based .net8 images with same (broken) result. This is really interesting idea - what tag is it ? |
I see. I use Standard_E2as_v5 for nodes in AKS and do observe the problem, to add a piece to puzzle. |
8.0-azurelinux3.0 |
Firewall/network settings between AKS node and SQL Server also could be a reason for this. Also I have observed same error with .NET 6 as well, but very rare (say once in 6 months or so). |
I decided to go forward with prod deployment today and so far no errors in the past ~4h. In the 24h before that I saw the error 258 times for that service. I also tested the debian based image again in my dev environment and saw the error soon after. So at least for us I think it is safe to say the SqlClient does not play well with the debian 12 based dotnet 8 image. |
After one-day testing no errors so far ! -🤞🤞🤞 |
This is interesting. For the record, is this true for ARM builds as well? |
Can't tell. Have only x64 software. |
I am only seeing this behavior with the SqlClient. If it was a generic SSL issue I would expect other connections showing similar issues. |
Also we had a handful connection errors over the weekend, but these appear in the database log and the others never did. So the issue still seems to be fixed with changing the base image. Now my only fear is, that whatever causes the issue in the debian image will find its way into azure linux at some point. |
AzureLinux 3.0 based container has met only one error so far ! A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 0 - Success) And this error looks a bit different (compare original one at top of this issue).
|
Story continues - even Azure Linux 3.0 has yielded the "35" error - but only once since 26-OCT (10 days) compared to several times a day with Debian 12. Today:
|
Looking forward to dotnet/efcore#33399 since it looks somehow related. |
@JiriZidek Did you try the SqlClient 5.1.6? Is the issue stil there? |
After migration from .NET7 to .NET8 SqlException 0x80131904 started to appear randomly when connecting to Azure SQL databases
The error happens in two flavors:
We observed that this error happens usually after a longer (>minutes) period of inactivity (kestrell web app not being hit). After first occurence, the successive DB operations succeed. We use SqlClient in our code indirectly through EF.
Version information
Microsoft.Data.SqlClient version: 5.2.0 (but same observed for 5.1.5) - Microsoft.Data.SqlClient.dll 5.20.324059.2 dated 28-FEB-2024 size 911480 bytes
Target framework: .NET 8 - 8.0.3 (but same observed for 8.0.2, 8.0.1 & 8.0.0)
Operating system: Linux docker image - same result on mcr.microsoft.com/dotnet/aspnet:8.0 and mcr.microsoft.com/dotnet/aspnet:8.0-jammy - hosted in AKS, similar behavior for backend service using image runtime:8.0
Relevant code
Original code - in .NET7 - now randomly failing .NET8:
Improved code - no better, same problems:
Failing place examples (they are ranadom in fact, in various programes):
Connection string
Connection string looks similar to
Server=tcp:SOME-sql.database.windows.net,1433;Initial Catalog=MyDB;Persist Security Info=False;User ID=some-admin;Password=123456789;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;
TrustServerCertificate=True
with no effect.MultipleActiveResultSets=True
with no effect.Full stack traces
We hoped the problem would be solved by 5.2.0, but it does not look to be the case.
The text was updated successfully, but these errors were encountered: