Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add async snapshot continue capability for multipacket fields #3161

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Feb 19, 2025

Split out from #2608 per discussion detailed in #2608 (comment)

Adds infrastructure to the async reader snapshot to allow the last packet of the snapshot to be captured as a continue point for the snapshot. On replay if a continue point is available it is used which skips re-doing the previous work improving speed.

The name and behaviour of the appcontext switch being used will need agreement and work. This continue capability requires the partial packet capability so legacy mode enabled on partial packets will force this feature to be disables. I expect there to be some discussion around the names and exact details so the current PR is in dev form, easy to see and change.

Benchmarks for reading a 20Mib string from a local sql server. As payload sizes increase and latency increases speed differences should be more noticeable. I invite and request people to do some benchmarking to make sure we're aware of the performance characteristics of this change.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314)
.NET SDK 9.0.200
  [Host]     : .NET 8.0.13 (8.0.1325.6609), X64 RyuJIT
  DefaultJob : .NET 8.0.13 (8.0.1325.6609), X64 RyuJIT


| Method | UseContinue | Mean      | Error     | StdDev    | Gen0      | Gen1      | Gen2      | Allocated |
|------- |------------ |----------:|----------:|----------:|----------:|----------:|----------:|----------:|
| Async  | False       | 757.51 ms | 15.053 ms | 36.642 ms | 2000.0000 | 1000.0000 |         - | 101.49 MB |
| Sync   | False       |  39.40 ms |  0.543 ms |  0.508 ms | 2000.0000 |  888.8889 |  777.7778 |  80.14 MB |
| Async  | True        |  49.45 ms |  0.901 ms |  1.376 ms | 4333.3333 | 3555.5556 | 1111.1111 | 101.51 MB |
| Sync   | True        |  40.09 ms |  0.476 ms |  0.445 ms | 2000.0000 |  888.8889 |  777.7778 |  80.14 MB |

/cc @MichelZ and @ErikEJ for interest

@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 19, 2025

/azp run

Copy link

Commenter does not have sufficient privileges for PR 3161 in repo dotnet/SqlClient

@mdaigle
Copy link
Contributor

mdaigle commented Feb 19, 2025

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mdaigle
Copy link
Contributor

mdaigle commented Feb 19, 2025

Commenter does not have sufficient privileges for PR 3161 in repo dotnet/SqlClient

Unfortunately, new security requirements restrict our ability to run CI for non-maintainers. Feel free to ping the team alias (@dotnet/sqlclientdevteam) or any of us directly and we'll do our best to trigger a pipeline run ASAP.

See also: #3152

@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 23, 2025

Based on the feedback provided I have added a new compatibility switch called Switch.Microsoft.Data.SqlClient.UseCompatibilityAsyncBehaviour which will disable the new behaviour. Is CompatProcessSni enabled the new switch will automatically return true. The new async behaviour is now enabled by default in this PR. Can someone re-run the CI please.

@ErikEJ
Copy link
Contributor

ErikEJ commented Feb 23, 2025

@Wraith2 Is this the "final" PR ?? (I lost track a few years ago)

@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 23, 2025

Yup, part 3 of 3 of #2608 so this build should be faster stable async strings.

@paulmedynski
Copy link
Contributor

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@paulmedynski paulmedynski requested a review from a team February 24, 2025 12:51
Copy link
Contributor

@paulmedynski paulmedynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for discussion, and a couple of typos.

@cheenamalhotra
Copy link
Member

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 25, 2025

CI is broken. Can you re-kick it please.

@paulmedynski paulmedynski requested a review from a team February 25, 2025 18:41
@mdaigle
Copy link
Contributor

mdaigle commented Feb 26, 2025

I restarted the failing jobs

@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 26, 2025

The jobs are failing because the manual tests are timing out. The tests are timing out because stream operations are hitting a path that I wasn't aware of (and didn't break the last two times I reimplemented this) so more work is needed.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 2, 2025

I've made many changes to get streams working. Most things seem to work now but it's hard to tell with how test experience is locally. Can someone on the @dotnet/sqlclientdevteam run the CI please.

@paulmedynski
Copy link
Contributor

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@paulmedynski paulmedynski requested a review from a team March 3, 2025 12:09
Copy link

codecov bot commented Mar 3, 2025

Codecov Report

Attention: Patch coverage is 92.24806% with 30 lines in your changes missing coverage. Please review.

Project coverage is 66.47%. Comparing base (d73bc16) to head (a2ad9d2).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...c/Microsoft/Data/SqlClient/TdsParserStateObject.cs 86.95% 18 Missing ⚠️
.../netcore/src/Microsoft/Data/SqlClient/TdsParser.cs 96.42% 4 Missing ⚠️
...nt/netfx/src/Microsoft/Data/SqlClient/TdsParser.cs 96.42% 4 Missing ⚠️
...icrosoft/Data/SqlClient/LocalAppContextSwitches.cs 71.42% 2 Missing ⚠️
...nt/src/Microsoft/Data/SqlClient/SqlCachedBuffer.cs 88.88% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (d73bc16) and HEAD (a2ad9d2). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (d73bc16) HEAD (a2ad9d2)
addons 1 0
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3161      +/-   ##
==========================================
- Coverage   72.93%   66.47%   -6.47%     
==========================================
  Files         287      282       -5     
  Lines       59173    59259      +86     
==========================================
- Hits        43160    39393    -3767     
- Misses      16013    19866    +3853     
Flag Coverage Δ
addons ?
netcore 69.47% <90.47%> (-6.12%) ⬇️
netfx 65.05% <90.54%> (-6.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 4, 2025

I'll try to look later. rebasing this set of changes is not fun.

@Wraith2 Wraith2 force-pushed the operation-status-part3 branch from 3e8eb21 to 0fbae86 Compare March 4, 2025 20:13
@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 4, 2025

Rebased, kick CI please.

use as instead of casts for safety
@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 4, 2025

I recovered the perf for the compatibility path and kept the extra speed on the new path.

@paulmedynski
Copy link
Contributor

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 5, 2025

@ErikEJ
Copy link
Contributor

ErikEJ commented Mar 5, 2025

@Wraith2 So I ran the EF Core 9 (release/9.0 branch) tests, and then updated to the package above.

The tests run around the same amount of time as previously, but two (one) tests are failing!

I have attached the results.

These are the queries against Northwind, you can find a .sql script with the database here in the EF Core repo:

test/EFCore.SqlServer.FunctionalTests/Northwind.sql

p0=NULL (Nullable = false)
SELECT * FROM "Employees" WHERE "ReportsTo" = @p0 OR ("ReportsTo" IS NULL AND @p0 IS NULL)

    [ConditionalTheory]
    [MemberData(nameof(IsAsyncData))]
    public virtual Task FromSqlRaw_queryable_with_null_parameter(bool async)
    {
        uint? reportsTo = null;

        return AssertQuery(
            async,
            ss => ((DbSet<Employee>)ss.Set<Employee>()).FromSqlRaw(
                NormalizeDelimitersInRawString(
                    // ReSharper disable once ExpressionIsAlwaysNull
                    "SELECT * FROM [Employees] WHERE [ReportsTo] = {0} OR ([ReportsTo] IS NULL AND {0} IS NULL)"), reportsTo),
            ss => ss.Set<Employee>().Where(x => x.ReportsTo == reportsTo));
    }

TestResults.zip

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 5, 2025

Thanks, it's a real new bug. I'll investigate.

@ErikEJ
Copy link
Contributor

ErikEJ commented Mar 5, 2025

@Wraith2 Cool, let me know if you need additional repro info.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 5, 2025

Fixed. It looks like the existing tests don't cover the case of a binary non-plp column so the image column in the query needed tweaking to be able to deal with the continue state. I've added covering tests based on the test provided.

@dotnet/sqlclientdevteam can you kick the CI please.

@mdaigle
Copy link
Contributor

mdaigle commented Mar 5, 2025

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 5, 2025

all green, artifacts here https://sqlclientdrivers.visualstudio.com/904996cc-6198-4d39-8540-eca72bdf0b7b/_apis/build/builds/110607/artifacts?artifactName=Artifacts&api-version=7.1&%24format=zip

If anyone can think of a way to get wider testing of the PR artifacts it might be interesting to try it. It's clear that the CI doesn't give coverage of all code paths and there's no reason to assume that EFCore tests cover regions that SqlClient ones don't. The more we find and fix before it reaches the unwilling public the better it will be.

@cheenamalhotra cheenamalhotra added this to the 7.0-preview1 milestone Mar 5, 2025
@ErikEJ
Copy link
Contributor

ErikEJ commented Mar 6, 2025

@Wraith2 All EF Core 9 tests passing with latest package, and run duration around the same as with 5.1.6

@ErikEJ
Copy link
Contributor

ErikEJ commented Mar 6, 2025

@Wraith2 A way to get wider testing is to include this in a 7.0 preview

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 6, 2025

All EF tests passing is good news. Unless the tests have a lot of long strings I wouldn't expect to see any large time differences. In terms of testing reading 1Mib is much the same as reading 100Mib.

It's been marked for inclusion in the next preview but I don't know how much update previews get. If previews are widely used then that will be useful.

@mdaigle
Copy link
Contributor

mdaigle commented Mar 12, 2025

I was able to replicate your performance numbers with the new internal perf tests we're building out. Very exciting!
I'm actively working through this PR and hope to submit any comments today. Thank you!

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 12, 2025

Good news. Thanks. Let me know if anything needs explaining I can either try to write it up or arrange another call to go through it.

@@ -159,6 +159,16 @@ public static bool UseCompatibilityProcessSni
return (bool)switchesType.GetProperty(nameof(UseCompatibilityProcessSni), BindingFlags.Public | BindingFlags.Static).GetValue(null);
}
}

public static bool UseCompatibilityAsyncBehaviour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this reflection layer added because LocalAppContextSwitches is internal?
cc @benrr101 another internal class testing challenge. We can't actually easily test our compatibility flags :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Im re-using one of the files from the main project to make sure we can't get out of sync between test and real implementations but that means that various classes and namespaces have to be put in exactly the right places.

@@ -37,18 +37,34 @@ private SqlCachedBuffer(List<byte[]> cachedBytes)
/// </summary>
internal static TdsOperationStatus TryCreate(SqlMetaDataPriv metadata, TdsParser parser, TdsParserStateObject stateObj, out SqlCachedBuffer buffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: This type class is XML specific. It's not used for other data types. Naming could use an update for clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wraith2 checking my understanding of this class:
PLP bytes are read from the network in chunks (when overall length is unknown) or in their entirety (when overall length is known). If chunked, each chunk ends up in a separate byte[]. Each byte[] is added to the overall list of byte[]s: _cachedBytes. Consumers can stream the PLP data by streaming through the byte arrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That matches what I found. It's the first time I've encountered this particular class and it was a bit surprising.

List<byte[]> cachedBytes = null;
if (isAvailable)
{
cachedBytes = stateObj.TryTakeSnapshotStorage() as List<byte[]>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation would we need to restore the cached bytes list from the snapshot. Isn't the point of this class that they are already cached here? Is it because this TryCreate method may be called multiple times? And if so, is there some way we can avoid that?

It is confusing to me that the snapshot storage may contain either byte[] or List<byte[]>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot storage can contain byte[], char[] or List<byte[]> depending on the whether we're reading binary, strings or xml. I hadn't realised that there was the List<byte[]> option until recently. The spanshot field used to be called _plpBuffer but I renamed it to be more general as I found that it needed to contain multiple types.

byteArr = new byte[cb];
result = stateObj.TryReadPlpBytes(ref byteArr, 0, cb, out cb);
byte[] byteArr = new byte[cb];
// pass false for the writeDataSizeToSnapshot parameter because we want to only take data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required due to the snapshot behavior above (using the snapshot to store the List<byte[]>)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In this situation in the cached buffer each read is self contained so we don't want any of the continue logic inside the callee to kick in.

return result;
if (result == TdsOperationStatus.NeedMoreData && isAvailable && cb == byteArr.Length)
{
// succeeded in getting the data but failed to find the next plp length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a concrete example of how this might happen? Is this if plplength is greater than the max chunk size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens at the end of each packet. We've read the bytes that we know were in the packet, we need more data but when we asked for another packet we got need-more-data back.

@@ -2063,7 +2157,10 @@ internal TdsOperationStatus TryReadNetworkPacket()
internal void PrepareReplaySnapshot()
{
_networkPacketTaskSource = null;
_snapshot.MoveToStart();
if (!_snapshot.MoveToContinue())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key line for the performance improvements. If we have a continuation point, we can start the packet replay from there instead of the start of the snapshot. This method is invoked whenever we continue an async read operation.

{
length = (ulong)stateObj.GetSnapshotStorageLength<byte>();
isNull = length == 0;
return TdsOperationStatus.Done;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we can skip because we'll only have a continuing snapshot if we already read the header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sort of.
In this case we must skip because the length is only present in the first packet before we start reading data. If we aren't in the first packet in the snapshot then a read here is incorrect.
Reading the length from the storage array avoids needing to keep another field in the snapshot to store the length of non-plp fields.

(bool isAvailable, bool isStarting, bool isContinuing) = stateObj.GetSnapshotStatuses();
if (isAvailable)
{
if (isContinuing || isStarting)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lamenting the fact that we have to check the snapshot state in so many different places 😢
Not to be addressed in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can appreciate the sentiment.
Consider that there was the possibility that it would need to have changes made in the relationship between the SqlDataReader and the parser which would be much more complicated. I've done my best to minimize the amount of code that needs to be changed to accommodate the new functionality. There may be future opportunities for improvement.

I found that the complexity in this PR is mostly not in the code it's in understanding the sequence of events during async callbacks. In particular the nuance between is isAvailable, isStarting and isContinuing flags isn't big in terms of code but matters immensely for correctness and the ability to turn off the continue capability.

if (isContinuing || isStarting)
{
temp = stateObj.TryTakeSnapshotStorage() as byte[];
Debug.Assert(bytes == null || bytes.Length == length, "stored buffer length must be null or must have been created with the correct length");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we expect the snapshot storage to have this length? It doesn't seem to me that this would be true in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a coordination with the other part of the code which is keeping making sure that the array has been written in the need-more-data state.

The only way to arrive at this line of code is if we are continuing and that means that we have reached at least the second packet and know that we have hit a packet transition where needs-more-data was returned, in this situation the array that we created to copy the data into while we were in the isStarting state will have been written back to the snapshot storage.

It is also logically incorrect to end an async read with a non-null _storage buffer in the snapshot. That means that you can never find a buffer that some previous operation left hanging around. An async operation must either succeed and have taken the array cleanly or the cleanup process for snapshots will have nulled the array.

while (charsLeft > 0)
{
if (partialReadInProgress)
{
goto resumePartialRead;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to avoid using goto even if it means another layer of nesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It's a supported and useful language construct which is doing exactly what we need in this case.
I know there are people who have ideological objections to goto but in this case it's the cleanest way to get the logic we need. Unfortunately the language definite assignment rules prevent it being used to go directly the place we really want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants