-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Add packed SearchValues<string> specialization for two strings #122749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…tring> Co-authored-by: stephentoub <[email protected]>
Co-authored-by: stephentoub <[email protected]>
src/libraries/System.Private.CoreLib/src/System/SearchValues/Strings/StringSearchValues.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: stephentoub <[email protected]>
src/libraries/System.Private.CoreLib/src/System/SearchValues/Strings/StringSearchValues.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a specialized SearchValues<string> implementation for exactly two strings, following the same vectorized approach used for single-string searches. The implementation picks 2 anchor characters per string and compares them simultaneously in the inner loop.
Key changes:
- New
TwoStringSearchValuesThreeChars<TCaseSensitivity>class that vectorizes search for two strings - Updated
StringSearchValues.csto use the specialization when exactly 2 values are provided - Added
GetSecondCharacterOffsethelper methods toCharacterFrequencyHelper.cswith constrained offset support for different-length strings
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| TwoStringSearchValuesThreeChars.cs | New specialized implementation using vectorized two-anchor-per-string comparison with Vector128/256/512 support |
| StringSearchValues.cs | Added TryCreateForTwoValues method with logic to select the specialization based on string properties and case sensitivity |
| CharacterFrequencyHelper.cs | Added GetSecondCharacterOffset overloads to select optimal anchor character positions, with support for constraining offsets to shorter string length |
| System.Private.CoreLib.Shared.projitems | Registered the new TwoStringSearchValuesThreeChars.cs file in the build |
| StringSearchValues.cs (tests) | Added test cases covering two-string scenarios with same/different first characters |
...s/System.Private.CoreLib/src/System/SearchValues/Strings/Helpers/CharacterFrequencyHelper.cs
Show resolved
Hide resolved
...es/System.Private.CoreLib/src/System/SearchValues/Strings/TwoStringSearchValuesThreeChars.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: stephentoub <[email protected]>
|
@copilot, System.Memory.Tests are failing |
I've run the System.Memory tests locally and they all pass (52,123 passed, 1 skipped). Could you share which specific test is failing in CI? This would help me identify if it's a platform-specific issue (ARM64, WASM, etc.) or something else. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Implemented in bc75124. The inner loop now:
This avoids 2 additional |
|
@EgorBot -amd -intel -arm using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(TwoStringSearchValuesBenchmark).Assembly).Run(args);
[MemoryDiagnoser]
public class TwoStringSearchValuesBenchmark
{
private string _haystack = default!;
private SearchValues<string> _twoStringsOrdinal = default!;
private SearchValues<string> _twoStringsIgnoreCase = default!;
[Params(100, 1000, 10000)]
public int HaystackLength { get; set; }
[Params("Early", "Middle", "Late", "None")]
public string MatchPosition { get; set; } = default!;
[GlobalSetup]
public void Setup()
{
// Create haystack with the match at various positions
char[] chars = new char[HaystackLength];
Random rand = new(42);
for (int i = 0; i < chars.Length; i++)
{
// Use characters that won't match our search strings
chars[i] = (char)('m' + rand.Next(10)); // m-v range
}
string needle1 = "hello";
string needle2 = "world";
int insertPos = MatchPosition switch
{
"Early" => 10,
"Middle" => HaystackLength / 2,
"Late" => HaystackLength - 10,
_ => -1
};
if (insertPos >= 0 && insertPos + needle1.Length <= HaystackLength)
{
needle1.AsSpan().CopyTo(chars.AsSpan(insertPos));
}
_haystack = new string(chars);
_twoStringsOrdinal = SearchValues.Create([needle1, needle2], StringComparison.Ordinal);
_twoStringsIgnoreCase = SearchValues.Create([needle1, needle2], StringComparison.OrdinalIgnoreCase);
}
[Benchmark]
public int TwoStrings_Ordinal() => _haystack.AsSpan().IndexOfAny(_twoStringsOrdinal);
[Benchmark]
public int TwoStrings_IgnoreCase() => _haystack.AsSpan().IndexOfAny(_twoStringsIgnoreCase);
} |
|
@copilot, the benchmark in the comments is effectively testing best case for this optimization. Can you tweak it to also test worst case? |
Here's an updated benchmark that tests both best case (infrequent first chars) and worst case (frequent first chars) for the early exit optimization: using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(TwoStringSearchValuesBenchmark).Assembly).Run(args);
[MemoryDiagnoser]
public class TwoStringSearchValuesBenchmark
{
private string _haystackBestCase = default!;
private string _haystackWorstCase = default!;
private SearchValues<string> _twoStringsOrdinal = default!;
private SearchValues<string> _twoStringsIgnoreCase = default!;
[Params(100, 1000, 10000)]
public int HaystackLength { get; set; }
[Params("Early", "Middle", "Late", "None")]
public string MatchPosition { get; set; } = default!;
[GlobalSetup]
public void Setup()
{
string needle1 = "hello";
string needle2 = "world";
int insertPos = MatchPosition switch
{
"Early" => 10,
"Middle" => HaystackLength / 2,
"Late" => HaystackLength - 10,
_ => -1
};
// Best case: haystack uses chars that don't match first chars of needles (h, w)
char[] charsBest = new char[HaystackLength];
Random rand = new(42);
for (int i = 0; i < charsBest.Length; i++)
{
charsBest[i] = (char)('m' + rand.Next(10)); // m-v range, avoids 'h' and 'w'
}
if (insertPos >= 0 && insertPos + needle1.Length <= HaystackLength)
{
needle1.AsSpan().CopyTo(charsBest.AsSpan(insertPos));
}
_haystackBestCase = new string(charsBest);
// Worst case: haystack is full of 'h' and 'w' (first chars of needles)
// This forces the early exit branch to rarely trigger
char[] charsWorst = new char[HaystackLength];
for (int i = 0; i < charsWorst.Length; i++)
{
charsWorst[i] = (i % 2 == 0) ? 'h' : 'w';
}
if (insertPos >= 0 && insertPos + needle1.Length <= HaystackLength)
{
needle1.AsSpan().CopyTo(charsWorst.AsSpan(insertPos));
}
_haystackWorstCase = new string(charsWorst);
_twoStringsOrdinal = SearchValues.Create([needle1, needle2], StringComparison.Ordinal);
_twoStringsIgnoreCase = SearchValues.Create([needle1, needle2], StringComparison.OrdinalIgnoreCase);
}
[Benchmark]
public int BestCase_Ordinal() => _haystackBestCase.AsSpan().IndexOfAny(_twoStringsOrdinal);
[Benchmark]
public int BestCase_IgnoreCase() => _haystackBestCase.AsSpan().IndexOfAny(_twoStringsIgnoreCase);
[Benchmark]
public int WorstCase_Ordinal() => _haystackWorstCase.AsSpan().IndexOfAny(_twoStringsOrdinal);
[Benchmark]
public int WorstCase_IgnoreCase() => _haystackWorstCase.AsSpan().IndexOfAny(_twoStringsIgnoreCase);
}The worst case fills the haystack with alternating 'h' and 'w' characters (the first characters of "hello" and "world"), which forces the early exit optimization to fail on every iteration and perform the full comparison including the second anchor character loads.
Here's an updated benchmark that tests both best case (no first char matches - rare 'h'/'w' in haystack) and worst case (frequent first char matches - 'a' appears often): using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(TwoStringSearchValuesBenchmark).Assembly).Run(args);
[MemoryDiagnoser]
public class TwoStringSearchValuesBenchmark
{
private string _haystackBestCase = default!;
private string _haystackWorstCase = default!;
private SearchValues<string> _twoStringsOrdinal = default!;
private SearchValues<string> _twoStringsIgnoreCase = default!;
private SearchValues<string> _worstCaseOrdinal = default!;
[Params(100, 1000, 10000)]
public int HaystackLength { get; set; }
[Params("Early", "Middle", "Late", "None")]
public string MatchPosition { get; set; } = default!;
[GlobalSetup]
public void Setup()
{
Random rand = new(42);
// Best case: haystack has no 'h' or 'w' (first chars of "hello"/"world")
char[] charsBest = new char[HaystackLength];
for (int i = 0; i < charsBest.Length; i++)
charsBest[i] = (char)('m' + rand.Next(10)); // m-v range, no h or w
// Worst case: haystack has many 'a's (first char of both needles)
char[] charsWorst = new char[HaystackLength];
for (int i = 0; i < charsWorst.Length; i++)
charsWorst[i] = rand.Next(4) == 0 ? 'a' : (char)('b' + rand.Next(10)); // 25% 'a's
string needle1Best = "hello";
string needle2Best = "world";
string needle1Worst = "abc";
string needle2Worst = "axy";
int insertPos = MatchPosition switch
{
"Early" => 10,
"Middle" => HaystackLength / 2,
"Late" => HaystackLength - 10,
_ => -1
};
if (insertPos >= 0 && insertPos + needle1Best.Length <= HaystackLength)
{
needle1Best.AsSpan().CopyTo(charsBest.AsSpan(insertPos));
needle1Worst.AsSpan().CopyTo(charsWorst.AsSpan(insertPos));
}
_haystackBestCase = new string(charsBest);
_haystackWorstCase = new string(charsWorst);
_twoStringsOrdinal = SearchValues.Create([needle1Best, needle2Best], StringComparison.Ordinal);
_twoStringsIgnoreCase = SearchValues.Create([needle1Best, needle2Best], StringComparison.OrdinalIgnoreCase);
_worstCaseOrdinal = SearchValues.Create([needle1Worst, needle2Worst], StringComparison.Ordinal);
}
[Benchmark]
public int BestCase_Ordinal() => _haystackBestCase.AsSpan().IndexOfAny(_twoStringsOrdinal);
[Benchmark]
public int BestCase_IgnoreCase() => _haystackBestCase.AsSpan().IndexOfAny(_twoStringsIgnoreCase);
[Benchmark]
public int WorstCase_Ordinal() => _haystackWorstCase.AsSpan().IndexOfAny(_worstCaseOrdinal);
}Best case: Haystack uses 'm'-'v' range, so 'h' and 'w' (first chars of "hello"/"world") never appear. Early exit optimization skips second char loads on every iteration. Worst case: Haystack has ~25% 'a' characters, and both needles start with 'a' ("abc", "axy"). Early exit optimization rarely helps since the branch is taken frequently. |
|
@EgorBot -amd -intel -arm using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(TwoStringSearchValuesBenchmark).Assembly).Run(args);
[MemoryDiagnoser]
public class TwoStringSearchValuesBenchmark
{
private string _haystackBestCase = default!;
private string _haystackWorstCase = default!;
private SearchValues<string> _twoStringsOrdinal = default!;
private SearchValues<string> _twoStringsIgnoreCase = default!;
private SearchValues<string> _worstCaseOrdinal = default!;
[Params(100, 1000, 10000)]
public int HaystackLength { get; set; }
[Params("Early", "Middle", "Late", "None")]
public string MatchPosition { get; set; } = default!;
[GlobalSetup]
public void Setup()
{
Random rand = new(42);
// Best case: haystack has no 'h' or 'w' (first chars of "hello"/"world")
char[] charsBest = new char[HaystackLength];
for (int i = 0; i < charsBest.Length; i++)
charsBest[i] = (char)('m' + rand.Next(10)); // m-v range, no h or w
// Worst case: haystack has many 'a's (first char of both needles)
char[] charsWorst = new char[HaystackLength];
for (int i = 0; i < charsWorst.Length; i++)
charsWorst[i] = rand.Next(4) == 0 ? 'a' : (char)('b' + rand.Next(10)); // 25% 'a's
string needle1Best = "hello";
string needle2Best = "world";
string needle1Worst = "abc";
string needle2Worst = "axy";
int insertPos = MatchPosition switch
{
"Early" => 10,
"Middle" => HaystackLength / 2,
"Late" => HaystackLength - 10,
_ => -1
};
if (insertPos >= 0 && insertPos + needle1Best.Length <= HaystackLength)
{
needle1Best.AsSpan().CopyTo(charsBest.AsSpan(insertPos));
needle1Worst.AsSpan().CopyTo(charsWorst.AsSpan(insertPos));
}
_haystackBestCase = new string(charsBest);
_haystackWorstCase = new string(charsWorst);
_twoStringsOrdinal = SearchValues.Create([needle1Best, needle2Best], StringComparison.Ordinal);
_twoStringsIgnoreCase = SearchValues.Create([needle1Best, needle2Best], StringComparison.OrdinalIgnoreCase);
_worstCaseOrdinal = SearchValues.Create([needle1Worst, needle2Worst], StringComparison.Ordinal);
}
[Benchmark]
public int BestCase_Ordinal() => _haystackBestCase.AsSpan().IndexOfAny(_twoStringsOrdinal);
[Benchmark]
public int BestCase_IgnoreCase() => _haystackBestCase.AsSpan().IndexOfAny(_twoStringsIgnoreCase);
[Benchmark]
public int WorstCase_Ordinal() => _haystackWorstCase.AsSpan().IndexOfAny(_worstCaseOrdinal);
} |
|
@copilot, revert the last commit implementating (1) and instead implement (4), choosing the second offset to have the lowest combined frequency across both needles |
…ency (optimization 4) Co-authored-by: stephentoub <[email protected]>
Implemented in fcbb1be. Reverted the early exit optimization and instead implemented shared offset optimization (4):
|
|
@EgorBot -amd -intel -arm using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(TwoStringSearchValuesBenchmark).Assembly).Run(args);
[MemoryDiagnoser]
public class TwoStringSearchValuesBenchmark
{
private string _haystackBestCase = default!;
private string _haystackWorstCase = default!;
private SearchValues<string> _twoStringsOrdinal = default!;
private SearchValues<string> _twoStringsIgnoreCase = default!;
private SearchValues<string> _worstCaseOrdinal = default!;
[Params(100, 1000, 10000)]
public int HaystackLength { get; set; }
[Params("Early", "Middle", "Late", "None")]
public string MatchPosition { get; set; } = default!;
[GlobalSetup]
public void Setup()
{
Random rand = new(42);
// Best case: haystack has no 'h' or 'w' (first chars of "hello"/"world")
char[] charsBest = new char[HaystackLength];
for (int i = 0; i < charsBest.Length; i++)
charsBest[i] = (char)('m' + rand.Next(10)); // m-v range, no h or w
// Worst case: haystack has many 'a's (first char of both needles)
char[] charsWorst = new char[HaystackLength];
for (int i = 0; i < charsWorst.Length; i++)
charsWorst[i] = rand.Next(4) == 0 ? 'a' : (char)('b' + rand.Next(10)); // 25% 'a's
string needle1Best = "hello";
string needle2Best = "world";
string needle1Worst = "abc";
string needle2Worst = "axy";
int insertPos = MatchPosition switch
{
"Early" => 10,
"Middle" => HaystackLength / 2,
"Late" => HaystackLength - 10,
_ => -1
};
if (insertPos >= 0 && insertPos + needle1Best.Length <= HaystackLength)
{
needle1Best.AsSpan().CopyTo(charsBest.AsSpan(insertPos));
needle1Worst.AsSpan().CopyTo(charsWorst.AsSpan(insertPos));
}
_haystackBestCase = new string(charsBest);
_haystackWorstCase = new string(charsWorst);
_twoStringsOrdinal = SearchValues.Create([needle1Best, needle2Best], StringComparison.Ordinal);
_twoStringsIgnoreCase = SearchValues.Create([needle1Best, needle2Best], StringComparison.OrdinalIgnoreCase);
_worstCaseOrdinal = SearchValues.Create([needle1Worst, needle2Worst], StringComparison.Ordinal);
}
[Benchmark]
public int BestCase_Ordinal() => _haystackBestCase.AsSpan().IndexOfAny(_twoStringsOrdinal);
[Benchmark]
public int BestCase_IgnoreCase() => _haystackBestCase.AsSpan().IndexOfAny(_twoStringsIgnoreCase);
[Benchmark]
public int WorstCase_Ordinal() => _haystackWorstCase.AsSpan().IndexOfAny(_worstCaseOrdinal);
} |
...s/System.Private.CoreLib/src/System/SearchValues/Strings/Helpers/CharacterFrequencyHelper.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: stephentoub <[email protected]>
Co-authored-by: stephentoub <[email protected]>
Enabled in 41b248e. Uncommented the case-sensitive prefix matching code block. All 31,470 regex tests pass (30,469 functional + 1,001 unit). |
|
@MihuBot benchmark Regex_Industry https://github.com/MihaZupan/performance/tree/compiled-regex-only -medium |
main PR
Description
Adds
TwoStringSearchValuesPackedThreeChars<TCaseSensitivity>- a packed vectorized specialization forSearchValues<string>when exactly two strings are provided.Approach: Uses packed byte comparisons similar to
SingleStringSearchValuesPackedThreeCharsto process twice as many characters per iteration:LoadPacked128/256/512to pack twoVector<ushort>inputs into oneVector<byte>StartsWithfor both valuesChanges:
TwoStringSearchValuesPackedThreeChars.cs- New packed vectorized implementation (Vector128/256/512)CharacterFrequencyHelper.cs- AddedGetSharedSecondCharacterOffsethelper that finds the offset with lowest combined ASCII character frequency across both valuesStringSearchValues.cs- AddedTryCreateForTwoValuesfactory methodRegexFindOptimizations.cs- Enabled case-sensitive prefix matching for two-string alternations to use the new specializationConstraints:
Customer Impact
N/A - This is a performance optimization, not a servicing fix.
Regression
No - new feature.
Testing
All 52,148 existing StringSearchValues tests pass. All 31,470 regex tests pass (30,469 functional + 1,001 unit). Added comprehensive test cases for two-string search patterns including:
Risk
Low - falls back to existing Teddy/Aho-Corasick for unsupported cases. Implementation follows established patterns from
SingleStringSearchValuesPackedThreeChars.Package authoring no longer needed in .NET 9
IMPORTANT: Starting with .NET 9, you no longer need to edit a NuGet package's csproj to enable building and bump the version.
Keep in mind that we still need package authoring in .NET 8 and older versions.
Original prompt
SearchValues<string>specialization for two or three input strings #91795💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.