Skip to content

Conversation

davepl
Copy link
Contributor

@davepl davepl commented Aug 31, 2025

Optimizations, primarily focused on mark_multiples.

Solution_5 is a significantly optimized version of solution_2, specifically designed to improve performance on ARM processors while maintaining cross-platform compatibility. The changes focus on advanced bit manipulation, memory alignment, and cache-friendly algorithms.

Key Performance Improvements

  1. Enhanced BitArray Class

Memory Alignment: Added 64-byte aligned memory allocation using posix_memalign() to optimize ARM SIMD operations
Dual Allocation Strategy: Graceful fallback to standard allocation if alignment fails
Compiler Hints: Added LIKELY/UNLIKELY branch prediction hints and attribute((always_inline)) for critical functions
Fast Prime Finding: New find_next_prime_bit() method uses 64-bit word scanning with __builtin_ctzll() for efficient prime discovery

  1. mark_multiples() Method

Word-wise Processing: Complete rewrite to process 64-bit words instead of individual bits
Adaptive Algorithm: Switches between optimized scalar loop (for large steps) and word-wise processing (for small steps)
Unrolled Loops: 8-way unrolling in scalar path and 4-way unrolling in word processing for better instruction-level parallelism
Cache-friendly mask precomputation with step pattern optimization
Bulk Memory Operations: Process up to 4 words simultaneously for improved memory throughput

  1. Optimized Sieve Algorithm
    Efficient Prime Scanning: Uses bit-domain scanning
    Reduced Function Call Overhead: Direct bit manipulation instead of repeated method calls

  2. Configuration and Portability

Compile-time Tuning: BITSTEP_WORDWISE_THRESHOLD macro for algorithm selection tuning
Platform Detection: Compiler-specific optimizations with fallbacks for maximum portability
Updated Headers: Added for POSIX compliance
Technical Details Reviewers Should Focus On
Algorithm Correctness: Verify that the word-wise bit manipulation in mark_multiples() correctly handles:

Bit alignment across word boundaries
Tail byte processing for partial words
Step mask computation and application
Memory Safety: Check the aligned memory allocation and deallocation paths

Check that strip-only configurations aren't affected

  • I read the contribution guidelines in CONTRIBUTING.md.
  • I placed my solution in the correct solution folder.
  • I added a README.md with the right badge(s).
  • I added a Dockerfile that builds and runs my solution.
  • I selected drag-race as the target branch.
  • All code herein is licensed compatible with BSD-3.

@davepl davepl requested review from Copilot and rbergen August 31, 2025 23:39
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Implements an optimized prime sieve algorithm (solution_5) focused on ARM processor performance with enhanced bit manipulation and cache-friendly operations. The changes introduce significant algorithmic improvements to the mark_multiples function while maintaining cross-platform compatibility.

  • Enhanced BitArray class with 64-byte aligned memory allocation and word-wise bit operations
  • Optimized mark_multiples method with adaptive algorithms and unrolled loops
  • Fast prime finding using 64-bit word scanning with compiler intrinsics

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
PrimeCPP/solution_5/run.sh Build script with optimized compiler flags for ARM performance
PrimeCPP/solution_5/PrimeCPP_array.cpp Complete optimized prime sieve implementation with advanced bit manipulation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@davepl
Copy link
Contributor Author

davepl commented Aug 31, 2025

All ARM-dependent optimizations are removed, and this now works well on x86 and ARM.

@rbergen
Copy link
Contributor

rbergen commented Sep 1, 2025

Now, tell me. You did actually time this PR so you could claim number 1,000, didn't you?

@davepl
Copy link
Contributor Author

davepl commented Sep 3, 2025

No, I just noticed that, but I'll take it!

Copy link
Contributor

@rbergen rbergen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I ran my own contributing requirements review using the same model, and this is what it came back with. Short story, some things do need to be changed.

Note that:

  • I have made some minor edits, to make the suggestions more useful, and indicate what suggestions are optional.
  • I did not review the code change suggestions that are included line-by-line, so these should be used with care.
  • I'm posting this, so I take ownership of the feedback.

Compliance review: PrimeCPP/solution_5

This is a compliance review of PrimeCPP/solution_5 against the repository’s CONTRIBUTING.md.

Summary

  • Present: PrimeCPP_array.cpp, run.sh
  • Missing: README.md (required), Dockerfile (required)
  • Classification:
    • algorithm=base
    • faithful=yes
    • parallel=yes
    • bits=1
  • Output: Not compliant; extra text is printed to stdout alongside the required result line; label should be made solution-unique; default limit differs from 1,000,000.

Details

  1. Presence of necessary files
  • Required but missing:
    • README.md with description, run instructions, and sample output (badges optional if tags are in program output).
    • Dockerfile that builds and runs the solution, emitting only the result line to stdout (auxiliary text to stderr).
  • Placement is correct: PrimeCPP/solution_5.
  • Optional architecture flags (arch-amd64/arch-arm64): not needed for this C++ solution.
  1. Characteristics classification
  • Algorithm: base
    • Odd-only Sieve of Eratosthenes; factors start at 3 up to floor(sqrt(n)); marking starts at factor^2; no wheel factorization.
  • Faithfulness: yes
    • prime_sieve encapsulates state; BitArray is dynamically allocated and constructed per pass; sieve is computed algorithmically; no precomputed primes; no external deps.
  • Parallelism: yes
    • Spawns multiple threads; each performs independent passes and totals are aggregated; thread count is reported.
  • Bits per candidate: 1
    • One bit per odd number; even numbers are implicit composites; flags are stored in a byte array.
  1. Output format and behavior
  • Required single line to stdout:
    • ;;<total_time>;<num_threads>;
    • Tags must include at least: algorithm, faithful, bits (e.g., algorithm=base,faithful=yes,bits=1).
  • Current issues:
    • Multiple human-readable lines are also printed to stdout before the required line; these must go to stderr.
    • Label is “davepl_array” (not solution-unique for this submission); include a unique solution label.
    • With -q, output diverges from the required format.
  • Duration:
    • Default behavior runs timed loops; oneshot simulates a 5-second run. Ensure timed runs are at least 5 seconds by default and that any auxiliary mode still emits exactly one standardized line.
  1. Other notes
  • Default limit:
    • Rule says calculate primes up to 1,000,000. Default is currently 10,000,000. Recommend changing default to 1,000,000.
  • CLI seconds (optional, but recommended):
    • Honor the requested seconds value in the threaded loop. Default remains 5 seconds.
  • Hadolint (optional):
    • After adding Dockerfile, validate with hadolint.sh CPP 5

Recommended changes and patches

A) Make stdout/stderr compliant, fix label, honor default limit

  • Route all non-result output (stats, prime listings, help) to stderr.
  • Always print exactly one standardized result line to stdout.
  • Change default limit to 1,000,000.
  • Honor requested seconds in the threaded loop; keep oneshot consistent.
// ...existing code...
const uint64_t DEFAULT_UPPER_LIMIT = 1'000'000LLU; // was 10'000'000
// ...existing code...

      void printResults(bool showResults, double duration, size_t passes, size_t threads) const
      {
          if (showResults)
              cerr << "2, ";

          size_t count = (Bits.size() >= 2); // Count 2 as prime if in range
          for (uint64_t num = 3; num <= Bits.size(); num += 2)
          {
              if (Bits.get(num))
              {
                  if (showResults)
                      cerr << num << ", ";
                  count++;
              }
          }

          if (showResults)
              cerr << "\n";

          // Human-readable stats -> stderr
          cerr << "Passes: "  << passes << ", "
               << "Threads: " << threads << ", "
               << "Time: "    << duration << ", "
               << "Average: " << duration/passes << ", "
               << "Limit: "   << Bits.size() << ", "
               << "Counts: "  << count << "/" << countPrimes() << ", "
               << "Valid: "   << (validateResults() ? "Pass" : "FAIL!")
               << "\n";

          // Single required result line -> stdout (adjust label to your username)
          cout << "davepl_array_optimized;" << passes << ";" << duration << ";" << threads
               << ";algorithm=base,faithful=yes,bits=1\n";
      }
// ...existing code...

int main(int argc, char **argv)
{
    // ...existing code...

    for (auto i = args.begin(); i != args.end(); ++i)
    {
        if (*i == "-h" || *i == "--help") 
        {
            cerr << "Syntax: " << argv[0] << " [-t,--threads threads] [-s,--seconds seconds] [-l,--limit limit] [-1,--oneshot] [-q,--quiet] [-h]" << endl;
            return 0;
        }
        // ...existing code handling args...
        else
        {
            fprintf(stderr, "Unknown argument: %s\n", i->c_str());
            return 0;
        }
    }

    if (!bQuiet)
    {
        cerr << "Primes Benchmark (c) 2025 Dave's Garage - https://github.com/davepl/primes" << endl;
        cerr << "--------------------------------------------------------------------------" << endl;
    }

    if (bOneshot)
        cerr << "Oneshot is on. A single pass will be used to simulate a run of the requested duration." << endl;

    if (bOneshot && (cSecondsRequested > 0 || cThreadsRequested > 1))
    {
        cerr << "Oneshot option cannot be mixed with second count or thread count." << endl;
        return 0;
    }

    auto cPasses      = 0;
    auto cSeconds     = (cSecondsRequested ? cSecondsRequested : 5);
    auto cThreads     = (cThreadsRequested ? cThreadsRequested : thread::hardware_concurrency());
    auto llUpperLimit = (ullLimitRequested ? ullLimitRequested : DEFAULT_UPPER_LIMIT);

    if (!bQuiet)
    {
        fprintf(stderr, "Computing primes to %llu on %d thread%s for %d second%s.\n",
            (unsigned long long)llUpperLimit,
            cThreads,
            cThreads == 1 ? "" : "s",
            cSeconds,
            cSeconds == 1 ? "" : "s"
        );
    }
    double duration;

    if (bOneshot)
    {
        auto tStart = steady_clock::now();
        prime_sieve(llUpperLimit).runSieve();
        auto tEnd = steady_clock::now() - tStart;
        duration = duration_cast<microseconds>(tEnd).count()/1000000.0;
    }
    else
    {
        auto tStart = steady_clock::now();
        std::vector<std::thread> threads(cThreads);
        std::vector<uint64_t> l_passes(cThreads);
        for (unsigned int i = 0; i < cThreads; i++)
        {
            threads[i] = std::thread([i, &l_passes, &tStart, cSeconds](uint64_t limit)
            {
                l_passes[i] = 0;
                while (duration_cast<seconds>(steady_clock::now() - tStart).count() < cSeconds) {
                    prime_sieve(limit).runSieve();
                    ++l_passes[i];
                }
            }, llUpperLimit);
        }
        for (auto i = 0; i < cThreads; i++) 
        {
            threads[i].join();
            cPasses += l_passes[i];
        }
        auto tEnd = steady_clock::now() - tStart;
        duration = duration_cast<microseconds>(tEnd).count()/1000000.0;
    }

    if (bOneshot)
    {
        cPasses = static_cast<size_t>(1.0 / duration * cSeconds);
        duration = static_cast<double>(cSeconds);
    }

    prime_sieve checkSieve(llUpperLimit);
    checkSieve.runSieve();
    auto result = checkSieve.validateResults() ? checkSieve.countPrimes() : 0;

    // Always emit the standardized line; all other text goes to stderr already
    checkSieve.printResults(bPrintPrimes, duration , cPasses, cThreads);

    return (int) result;
}

B) Add README.md

# C++ solution by <your-username>

![Algorithm](https://img.shields.io/badge/Algorithm-base-green)
![Faithfulness](https://img.shields.io/badge/Faithful-yes-green)
![Parallelism](https://img.shields.io/badge/Parallel-yes-green)
![Bit%20count](https://img.shields.io/badge/Bits-1-green)

Odd-only, bit-packed Sieve of Eratosthenes with word-wise multiple marking. Multi-threaded across independent passes.

## Run instructions
- Docker:
  - docker build -t primes-cpp-solution-5 .
  - docker run --rm primes-cpp-solution-5
- Local:
  - g++ -O3 -std=c++17 -pthread PrimeCPP_array.cpp -o primes
  - ./primes

## Output
```
davepl_array_optimized;1234;5.00312;8;algorithm=base,faithful=yes,bits=1
```

C) Add Dockerfile

FROM ubuntu:22.04

RUN apt-get update \
 && apt-get install -y --no-install-recommends build-essential ca-certificates \
 && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY . /app

RUN g++ -O3 -std=c++17 -pthread PrimeCPP_array.cpp -o primes

# Program must print exactly one result line to stdout; all other output goes to stderr.
CMD ["/app/primes"]

D) Keep run.sh simple and aligned (optional)

#!/usr/bin/env bash
set -euo pipefail

g++ -O3 -std=c++17 -pthread PrimeCPP_array.cpp -o primes
./primes

Validation checklist

  • Files:
    • README.md present with required sections.
    • Dockerfile builds and runs; hadolint passes: hadolint.sh CPP 5.
  • Behavior:
    • Default limit = 1,000,000.
    • Runs for ≥ 5 seconds by default (optional: configurable via --seconds).
    • Single standardized line to stdout: ;;<total_time>;<num_threads>;.
    • Tags include algorithm=base,faithful=yes,bits=1.
    • All other output goes to stderr.
  • Characteristics:
    • algorithm=base, faithful=yes, parallel=yes, bits=1.

Copy link
Contributor

@rbergen rbergen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@davepl davepl merged commit 4f191ff into drag-race Sep 4, 2025
341 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants