-
Notifications
You must be signed in to change notification settings - Fork 587
Mark multiples optimizations #1000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Implements an optimized prime sieve algorithm (solution_5) focused on ARM processor performance with enhanced bit manipulation and cache-friendly operations. The changes introduce significant algorithmic improvements to the mark_multiples function while maintaining cross-platform compatibility.
- Enhanced BitArray class with 64-byte aligned memory allocation and word-wise bit operations
- Optimized mark_multiples method with adaptive algorithms and unrolled loops
- Fast prime finding using 64-bit word scanning with compiler intrinsics
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
PrimeCPP/solution_5/run.sh | Build script with optimized compiler flags for ARM performance |
PrimeCPP/solution_5/PrimeCPP_array.cpp | Complete optimized prime sieve implementation with advanced bit manipulation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
All ARM-dependent optimizations are removed, and this now works well on x86 and ARM. |
Now, tell me. You did actually time this PR so you could claim number 1,000, didn't you? |
No, I just noticed that, but I'll take it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I ran my own contributing requirements review using the same model, and this is what it came back with. Short story, some things do need to be changed.
Note that:
- I have made some minor edits, to make the suggestions more useful, and indicate what suggestions are optional.
- I did not review the code change suggestions that are included line-by-line, so these should be used with care.
- I'm posting this, so I take ownership of the feedback.
Compliance review: PrimeCPP/solution_5
This is a compliance review of PrimeCPP/solution_5 against the repository’s CONTRIBUTING.md.
Summary
- Present: PrimeCPP_array.cpp, run.sh
- Missing: README.md (required), Dockerfile (required)
- Classification:
- algorithm=base
- faithful=yes
- parallel=yes
- bits=1
- Output: Not compliant; extra text is printed to stdout alongside the required result line; label should be made solution-unique; default limit differs from 1,000,000.
Details
- Presence of necessary files
- Required but missing:
- README.md with description, run instructions, and sample output (badges optional if tags are in program output).
- Dockerfile that builds and runs the solution, emitting only the result line to stdout (auxiliary text to stderr).
- Placement is correct: PrimeCPP/solution_5.
- Optional architecture flags (arch-amd64/arch-arm64): not needed for this C++ solution.
- Characteristics classification
- Algorithm: base
- Odd-only Sieve of Eratosthenes; factors start at 3 up to floor(sqrt(n)); marking starts at factor^2; no wheel factorization.
- Faithfulness: yes
- prime_sieve encapsulates state; BitArray is dynamically allocated and constructed per pass; sieve is computed algorithmically; no precomputed primes; no external deps.
- Parallelism: yes
- Spawns multiple threads; each performs independent passes and totals are aggregated; thread count is reported.
- Bits per candidate: 1
- One bit per odd number; even numbers are implicit composites; flags are stored in a byte array.
- Output format and behavior
- Required single line to stdout:
- ;;<total_time>;<num_threads>;
- Tags must include at least: algorithm, faithful, bits (e.g., algorithm=base,faithful=yes,bits=1).
- Current issues:
- Multiple human-readable lines are also printed to stdout before the required line; these must go to stderr.
- Label is “davepl_array” (not solution-unique for this submission); include a unique solution label.
- With -q, output diverges from the required format.
- Duration:
- Default behavior runs timed loops; oneshot simulates a 5-second run. Ensure timed runs are at least 5 seconds by default and that any auxiliary mode still emits exactly one standardized line.
- Other notes
- Default limit:
- Rule says calculate primes up to 1,000,000. Default is currently 10,000,000. Recommend changing default to 1,000,000.
- CLI seconds (optional, but recommended):
- Honor the requested seconds value in the threaded loop. Default remains 5 seconds.
- Hadolint (optional):
- After adding Dockerfile, validate with hadolint.sh CPP 5
Recommended changes and patches
A) Make stdout/stderr compliant, fix label, honor default limit
- Route all non-result output (stats, prime listings, help) to stderr.
- Always print exactly one standardized result line to stdout.
- Change default limit to 1,000,000.
- Honor requested seconds in the threaded loop; keep oneshot consistent.
// ...existing code...
const uint64_t DEFAULT_UPPER_LIMIT = 1'000'000LLU; // was 10'000'000
// ...existing code...
void printResults(bool showResults, double duration, size_t passes, size_t threads) const
{
if (showResults)
cerr << "2, ";
size_t count = (Bits.size() >= 2); // Count 2 as prime if in range
for (uint64_t num = 3; num <= Bits.size(); num += 2)
{
if (Bits.get(num))
{
if (showResults)
cerr << num << ", ";
count++;
}
}
if (showResults)
cerr << "\n";
// Human-readable stats -> stderr
cerr << "Passes: " << passes << ", "
<< "Threads: " << threads << ", "
<< "Time: " << duration << ", "
<< "Average: " << duration/passes << ", "
<< "Limit: " << Bits.size() << ", "
<< "Counts: " << count << "/" << countPrimes() << ", "
<< "Valid: " << (validateResults() ? "Pass" : "FAIL!")
<< "\n";
// Single required result line -> stdout (adjust label to your username)
cout << "davepl_array_optimized;" << passes << ";" << duration << ";" << threads
<< ";algorithm=base,faithful=yes,bits=1\n";
}
// ...existing code...
int main(int argc, char **argv)
{
// ...existing code...
for (auto i = args.begin(); i != args.end(); ++i)
{
if (*i == "-h" || *i == "--help")
{
cerr << "Syntax: " << argv[0] << " [-t,--threads threads] [-s,--seconds seconds] [-l,--limit limit] [-1,--oneshot] [-q,--quiet] [-h]" << endl;
return 0;
}
// ...existing code handling args...
else
{
fprintf(stderr, "Unknown argument: %s\n", i->c_str());
return 0;
}
}
if (!bQuiet)
{
cerr << "Primes Benchmark (c) 2025 Dave's Garage - https://github.com/davepl/primes" << endl;
cerr << "--------------------------------------------------------------------------" << endl;
}
if (bOneshot)
cerr << "Oneshot is on. A single pass will be used to simulate a run of the requested duration." << endl;
if (bOneshot && (cSecondsRequested > 0 || cThreadsRequested > 1))
{
cerr << "Oneshot option cannot be mixed with second count or thread count." << endl;
return 0;
}
auto cPasses = 0;
auto cSeconds = (cSecondsRequested ? cSecondsRequested : 5);
auto cThreads = (cThreadsRequested ? cThreadsRequested : thread::hardware_concurrency());
auto llUpperLimit = (ullLimitRequested ? ullLimitRequested : DEFAULT_UPPER_LIMIT);
if (!bQuiet)
{
fprintf(stderr, "Computing primes to %llu on %d thread%s for %d second%s.\n",
(unsigned long long)llUpperLimit,
cThreads,
cThreads == 1 ? "" : "s",
cSeconds,
cSeconds == 1 ? "" : "s"
);
}
double duration;
if (bOneshot)
{
auto tStart = steady_clock::now();
prime_sieve(llUpperLimit).runSieve();
auto tEnd = steady_clock::now() - tStart;
duration = duration_cast<microseconds>(tEnd).count()/1000000.0;
}
else
{
auto tStart = steady_clock::now();
std::vector<std::thread> threads(cThreads);
std::vector<uint64_t> l_passes(cThreads);
for (unsigned int i = 0; i < cThreads; i++)
{
threads[i] = std::thread([i, &l_passes, &tStart, cSeconds](uint64_t limit)
{
l_passes[i] = 0;
while (duration_cast<seconds>(steady_clock::now() - tStart).count() < cSeconds) {
prime_sieve(limit).runSieve();
++l_passes[i];
}
}, llUpperLimit);
}
for (auto i = 0; i < cThreads; i++)
{
threads[i].join();
cPasses += l_passes[i];
}
auto tEnd = steady_clock::now() - tStart;
duration = duration_cast<microseconds>(tEnd).count()/1000000.0;
}
if (bOneshot)
{
cPasses = static_cast<size_t>(1.0 / duration * cSeconds);
duration = static_cast<double>(cSeconds);
}
prime_sieve checkSieve(llUpperLimit);
checkSieve.runSieve();
auto result = checkSieve.validateResults() ? checkSieve.countPrimes() : 0;
// Always emit the standardized line; all other text goes to stderr already
checkSieve.printResults(bPrintPrimes, duration , cPasses, cThreads);
return (int) result;
}
B) Add README.md
# C++ solution by <your-username>




Odd-only, bit-packed Sieve of Eratosthenes with word-wise multiple marking. Multi-threaded across independent passes.
## Run instructions
- Docker:
- docker build -t primes-cpp-solution-5 .
- docker run --rm primes-cpp-solution-5
- Local:
- g++ -O3 -std=c++17 -pthread PrimeCPP_array.cpp -o primes
- ./primes
## Output
```
davepl_array_optimized;1234;5.00312;8;algorithm=base,faithful=yes,bits=1
```
C) Add Dockerfile
FROM ubuntu:22.04
RUN apt-get update \
&& apt-get install -y --no-install-recommends build-essential ca-certificates \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY . /app
RUN g++ -O3 -std=c++17 -pthread PrimeCPP_array.cpp -o primes
# Program must print exactly one result line to stdout; all other output goes to stderr.
CMD ["/app/primes"]
D) Keep run.sh simple and aligned (optional)
#!/usr/bin/env bash
set -euo pipefail
g++ -O3 -std=c++17 -pthread PrimeCPP_array.cpp -o primes
./primes
Validation checklist
- Files:
- README.md present with required sections.
- Dockerfile builds and runs; hadolint passes: hadolint.sh CPP 5.
- Behavior:
- Default limit = 1,000,000.
- Runs for ≥ 5 seconds by default (optional: configurable via --seconds).
- Single standardized line to stdout: ;;<total_time>;<num_threads>;.
- Tags include algorithm=base,faithful=yes,bits=1.
- All other output goes to stderr.
- Characteristics:
- algorithm=base, faithful=yes, parallel=yes, bits=1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Optimizations, primarily focused on mark_multiples.
Solution_5 is a significantly optimized version of solution_2, specifically designed to improve performance on ARM processors while maintaining cross-platform compatibility. The changes focus on advanced bit manipulation, memory alignment, and cache-friendly algorithms.
Key Performance Improvements
Memory Alignment: Added 64-byte aligned memory allocation using posix_memalign() to optimize ARM SIMD operations
Dual Allocation Strategy: Graceful fallback to standard allocation if alignment fails
Compiler Hints: Added LIKELY/UNLIKELY branch prediction hints and attribute((always_inline)) for critical functions
Fast Prime Finding: New find_next_prime_bit() method uses 64-bit word scanning with __builtin_ctzll() for efficient prime discovery
Word-wise Processing: Complete rewrite to process 64-bit words instead of individual bits
Adaptive Algorithm: Switches between optimized scalar loop (for large steps) and word-wise processing (for small steps)
Unrolled Loops: 8-way unrolling in scalar path and 4-way unrolling in word processing for better instruction-level parallelism
Cache-friendly mask precomputation with step pattern optimization
Bulk Memory Operations: Process up to 4 words simultaneously for improved memory throughput
Optimized Sieve Algorithm
Efficient Prime Scanning: Uses bit-domain scanning
Reduced Function Call Overhead: Direct bit manipulation instead of repeated method calls
Configuration and Portability
Compile-time Tuning: BITSTEP_WORDWISE_THRESHOLD macro for algorithm selection tuning
Platform Detection: Compiler-specific optimizations with fallbacks for maximum portability
Updated Headers: Added for POSIX compliance
Technical Details Reviewers Should Focus On
Algorithm Correctness: Verify that the word-wise bit manipulation in mark_multiples() correctly handles:
Bit alignment across word boundaries
Tail byte processing for partial words
Step mask computation and application
Memory Safety: Check the aligned memory allocation and deallocation paths
Check that strip-only configurations aren't affected
drag-race
as the target branch.