Mark multiples optimizations #1000

davepl · 2025-08-31T23:30:49Z

Optimizations, primarily focused on mark_multiples.

Solution_5 is a significantly optimized version of solution_2, specifically designed to improve performance on ARM processors while maintaining cross-platform compatibility. The changes focus on advanced bit manipulation, memory alignment, and cache-friendly algorithms.

Key Performance Improvements

Enhanced BitArray Class

Memory Alignment: Added 64-byte aligned memory allocation using posix_memalign() to optimize ARM SIMD operations
Dual Allocation Strategy: Graceful fallback to standard allocation if alignment fails
Compiler Hints: Added LIKELY/UNLIKELY branch prediction hints and attribute((always_inline)) for critical functions
Fast Prime Finding: New find_next_prime_bit() method uses 64-bit word scanning with __builtin_ctzll() for efficient prime discovery

mark_multiples() Method

Word-wise Processing: Complete rewrite to process 64-bit words instead of individual bits
Adaptive Algorithm: Switches between optimized scalar loop (for large steps) and word-wise processing (for small steps)
Unrolled Loops: 8-way unrolling in scalar path and 4-way unrolling in word processing for better instruction-level parallelism
Cache-friendly mask precomputation with step pattern optimization
Bulk Memory Operations: Process up to 4 words simultaneously for improved memory throughput

Optimized Sieve Algorithm
Efficient Prime Scanning: Uses bit-domain scanning
Reduced Function Call Overhead: Direct bit manipulation instead of repeated method calls
Configuration and Portability

Compile-time Tuning: BITSTEP_WORDWISE_THRESHOLD macro for algorithm selection tuning
Platform Detection: Compiler-specific optimizations with fallbacks for maximum portability
Updated Headers: Added for POSIX compliance
Technical Details Reviewers Should Focus On
Algorithm Correctness: Verify that the word-wise bit manipulation in mark_multiples() correctly handles:

Bit alignment across word boundaries
Tail byte processing for partial words
Step mask computation and application
Memory Safety: Check the aligned memory allocation and deallocation paths

Check that strip-only configurations aren't affected

I read the contribution guidelines in CONTRIBUTING.md.
I placed my solution in the correct solution folder.
I added a README.md with the right badge(s).
I added a Dockerfile that builds and runs my solution.
I selected drag-race as the target branch.
All code herein is licensed compatible with BSD-3.

Copilot

Pull Request Overview

Implements an optimized prime sieve algorithm (solution_5) focused on ARM processor performance with enhanced bit manipulation and cache-friendly operations. The changes introduce significant algorithmic improvements to the mark_multiples function while maintaining cross-platform compatibility.

Enhanced BitArray class with 64-byte aligned memory allocation and word-wise bit operations
Optimized mark_multiples method with adaptive algorithms and unrolled loops
Fast prime finding using 64-bit word scanning with compiler intrinsics

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
PrimeCPP/solution_5/run.sh	Build script with optimized compiler flags for ARM performance
PrimeCPP/solution_5/PrimeCPP_array.cpp	Complete optimized prime sieve implementation with advanced bit manipulation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

PrimeCPP/solution_5/PrimeCPP_array.cpp

davepl · 2025-08-31T23:41:04Z

All ARM-dependent optimizations are removed, and this now works well on x86 and ARM.

PrimeCPP/solution_5/PrimeCPP_array.cpp

rbergen · 2025-09-01T06:27:13Z

Now, tell me. You did actually time this PR so you could claim number 1,000, didn't you?

davepl · 2025-09-03T00:56:43Z

No, I just noticed that, but I'll take it!

rbergen

So, I ran my own contributing requirements review using the same model, and this is what it came back with. Short story, some things do need to be changed.

Note that:

I have made some minor edits, to make the suggestions more useful, and indicate what suggestions are optional.
I did not review the code change suggestions that are included line-by-line, so these should be used with care.
I'm posting this, so I take ownership of the feedback.

Compliance review: PrimeCPP/solution_5

This is a compliance review of PrimeCPP/solution_5 against the repository’s CONTRIBUTING.md.

Summary

Present: PrimeCPP_array.cpp, run.sh
Missing: README.md (required), Dockerfile (required)
Classification:
- algorithm=base
- faithful=yes
- parallel=yes
- bits=1
Output: Not compliant; extra text is printed to stdout alongside the required result line; label should be made solution-unique; default limit differs from 1,000,000.

Details

Presence of necessary files

Required but missing:
- README.md with description, run instructions, and sample output (badges optional if tags are in program output).
- Dockerfile that builds and runs the solution, emitting only the result line to stdout (auxiliary text to stderr).
Placement is correct: PrimeCPP/solution_5.
Optional architecture flags (arch-amd64/arch-arm64): not needed for this C++ solution.

Characteristics classification

Algorithm: base
- Odd-only Sieve of Eratosthenes; factors start at 3 up to floor(sqrt(n)); marking starts at factor^2; no wheel factorization.
Faithfulness: yes
- prime_sieve encapsulates state; BitArray is dynamically allocated and constructed per pass; sieve is computed algorithmically; no precomputed primes; no external deps.
Parallelism: yes
- Spawns multiple threads; each performs independent passes and totals are aggregated; thread count is reported.
Bits per candidate: 1
- One bit per odd number; even numbers are implicit composites; flags are stored in a byte array.

Output format and behavior

Required single line to stdout:
- ;;<total_time>;<num_threads>;
- Tags must include at least: algorithm, faithful, bits (e.g., algorithm=base,faithful=yes,bits=1).
Current issues:
- Multiple human-readable lines are also printed to stdout before the required line; these must go to stderr.
- Label is “davepl_array” (not solution-unique for this submission); include a unique solution label.
- With -q, output diverges from the required format.
Duration:
- Default behavior runs timed loops; oneshot simulates a 5-second run. Ensure timed runs are at least 5 seconds by default and that any auxiliary mode still emits exactly one standardized line.

Other notes

Default limit:
- Rule says calculate primes up to 1,000,000. Default is currently 10,000,000. Recommend changing default to 1,000,000.
CLI seconds (optional, but recommended):
- Honor the requested seconds value in the threaded loop. Default remains 5 seconds.
Hadolint (optional):
- After adding Dockerfile, validate with hadolint.sh CPP 5

Recommended changes and patches

A) Make stdout/stderr compliant, fix label, honor default limit

Route all non-result output (stats, prime listings, help) to stderr.
Always print exactly one standardized result line to stdout.
Change default limit to 1,000,000.
Honor requested seconds in the threaded loop; keep oneshot consistent.

// ...existing code...
const uint64_t DEFAULT_UPPER_LIMIT = 1'000'000LLU; // was 10'000'000
// ...existing code...

      void printResults(bool showResults, double duration, size_t passes, size_t threads) const
      {
          if (showResults)
              cerr << "2, ";

          size_t count = (Bits.size() >= 2); // Count 2 as prime if in range
          for (uint64_t num = 3; num <= Bits.size(); num += 2)
          {
              if (Bits.get(num))
              {
                  if (showResults)
                      cerr << num << ", ";
                  count++;
              }
          }

          if (showResults)
              cerr << "\n";

          // Human-readable stats -> stderr
          cerr << "Passes: "  << passes << ", "
               << "Threads: " << threads << ", "
               << "Time: "    << duration << ", "
               << "Average: " << duration/passes << ", "
               << "Limit: "   << Bits.size() << ", "
               << "Counts: "  << count << "/" << countPrimes() << ", "
               << "Valid: "   << (validateResults() ? "Pass" : "FAIL!")
               << "\n";

          // Single required result line -> stdout (adjust label to your username)
          cout << "davepl_array_optimized;" << passes << ";" << duration << ";" << threads
               << ";algorithm=base,faithful=yes,bits=1\n";
      }
// ...existing code...

int main(int argc, char **argv)
{
    // ...existing code...

    for (auto i = args.begin(); i != args.end(); ++i)
    {
        if (*i == "-h" || *i == "--help") 
        {
            cerr << "Syntax: " << argv[0] << " [-t,--threads threads] [-s,--seconds seconds] [-l,--limit limit] [-1,--oneshot] [-q,--quiet] [-h]" << endl;
            return 0;
        }
        // ...existing code handling args...
        else
        {
            fprintf(stderr, "Unknown argument: %s\n", i->c_str());
            return 0;
        }
    }

    if (!bQuiet)
    {
        cerr << "Primes Benchmark (c) 2025 Dave's Garage - https://github.com/davepl/primes" << endl;
        cerr << "--------------------------------------------------------------------------" << endl;
    }

    if (bOneshot)
        cerr << "Oneshot is on. A single pass will be used to simulate a run of the requested duration." << endl;

    if (bOneshot && (cSecondsRequested > 0 || cThreadsRequested > 1))
    {
        cerr << "Oneshot option cannot be mixed with second count or thread count." << endl;
        return 0;
    }

    auto cPasses      = 0;
    auto cSeconds     = (cSecondsRequested ? cSecondsRequested : 5);
    auto cThreads     = (cThreadsRequested ? cThreadsRequested : thread::hardware_concurrency());
    auto llUpperLimit = (ullLimitRequested ? ullLimitRequested : DEFAULT_UPPER_LIMIT);

    if (!bQuiet)
    {
        fprintf(stderr, "Computing primes to %llu on %d thread%s for %d second%s.\n",
            (unsigned long long)llUpperLimit,
            cThreads,
            cThreads == 1 ? "" : "s",
            cSeconds,
            cSeconds == 1 ? "" : "s"
        );
    }
    double duration;

    if (bOneshot)
    {
        auto tStart = steady_clock::now();
        prime_sieve(llUpperLimit).runSieve();
        auto tEnd = steady_clock::now() - tStart;
        duration = duration_cast<microseconds>(tEnd).count()/1000000.0;
    }
    else
    {
        auto tStart = steady_clock::now();
        std::vector<std::thread> threads(cThreads);
        std::vector<uint64_t> l_passes(cThreads);
        for (unsigned int i = 0; i < cThreads; i++)
        {
            threads[i] = std::thread([i, &l_passes, &tStart, cSeconds](uint64_t limit)
            {
                l_passes[i] = 0;
                while (duration_cast<seconds>(steady_clock::now() - tStart).count() < cSeconds) {
                    prime_sieve(limit).runSieve();
                    ++l_passes[i];
                }
            }, llUpperLimit);
        }
        for (auto i = 0; i < cThreads; i++) 
        {
            threads[i].join();
            cPasses += l_passes[i];
        }
        auto tEnd = steady_clock::now() - tStart;
        duration = duration_cast<microseconds>(tEnd).count()/1000000.0;
    }

    if (bOneshot)
    {
        cPasses = static_cast<size_t>(1.0 / duration * cSeconds);
        duration = static_cast<double>(cSeconds);
    }

    prime_sieve checkSieve(llUpperLimit);
    checkSieve.runSieve();
    auto result = checkSieve.validateResults() ? checkSieve.countPrimes() : 0;

    // Always emit the standardized line; all other text goes to stderr already
    checkSieve.printResults(bPrintPrimes, duration , cPasses, cThreads);

    return (int) result;
}

B) Add README.md

# C++ solution by <your-username>

![Algorithm](https://img.shields.io/badge/Algorithm-base-green)
![Faithfulness](https://img.shields.io/badge/Faithful-yes-green)
![Parallelism](https://img.shields.io/badge/Parallel-yes-green)
![Bit%20count](https://img.shields.io/badge/Bits-1-green)

Odd-only, bit-packed Sieve of Eratosthenes with word-wise multiple marking. Multi-threaded across independent passes.

## Run instructions
- Docker:
  - docker build -t primes-cpp-solution-5 .
  - docker run --rm primes-cpp-solution-5
- Local:
  - g++ -O3 -std=c++17 -pthread PrimeCPP_array.cpp -o primes
  - ./primes

## Output
```
davepl_array_optimized;1234;5.00312;8;algorithm=base,faithful=yes,bits=1
```

C) Add Dockerfile

FROM ubuntu:22.04

RUN apt-get update \
 && apt-get install -y --no-install-recommends build-essential ca-certificates \
 && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY . /app

RUN g++ -O3 -std=c++17 -pthread PrimeCPP_array.cpp -o primes

# Program must print exactly one result line to stdout; all other output goes to stderr.
CMD ["/app/primes"]

D) Keep run.sh simple and aligned (optional)

#!/usr/bin/env bash
set -euo pipefail

g++ -O3 -std=c++17 -pthread PrimeCPP_array.cpp -o primes
./primes

Validation checklist

Files:
- README.md present with required sections.
- Dockerfile builds and runs; hadolint passes: hadolint.sh CPP 5.
Behavior:
- Default limit = 1,000,000.
- Runs for ≥ 5 seconds by default (optional: configurable via --seconds).
- Single standardized line to stdout: ;;<total_time>;<num_threads>;.
- Tags include algorithm=base,faithful=yes,bits=1.
- All other output goes to stderr.
Characteristics:
- algorithm=base, faithful=yes, parallel=yes, bits=1.

rbergen

LGTM!

davepl added 13 commits August 31, 2025 11:15

Add new ARM-specific version

a291f76

getOddBitByIndex

d23cb05

Modulo-free mask indexing and tail per-bit loop

d64533a

Fix format

103ec78

Add note

9897cea

Cleanup

4c1b556

Move things to solution_5

1d72f72

Mark as executable

a399fcd

Remove last ARM code

a51bc99

Aggressive loop unrolling for 31% increase

2739dcc

Turn on branch hints

927a87b

Make it compliant with the rules

bd524e1

Cleanup

37578d3

davepl requested review from Copilot and rbergen August 31, 2025 23:39

Copilot AI reviewed Aug 31, 2025

View reviewed changes

PrimeCPP/solution_5/PrimeCPP_array.cpp Outdated Show resolved Hide resolved

PrimeCPP/solution_5/PrimeCPP_array.cpp Outdated Show resolved Hide resolved

PrimeCPP/solution_5/PrimeCPP_array.cpp Show resolved Hide resolved

Cleanup

2c06bd7

davepl commented Sep 1, 2025

View reviewed changes

PrimeCPP/solution_5/PrimeCPP_array.cpp Show resolved Hide resolved

davepl added 5 commits August 31, 2025 18:43

Remove stray period

507362a

Clean up line breaks, whitespace

980dc73

More cleanup and reformatting

1bf91a3

Clean up braces

aebe85a

Remove bounds check

f8d2488

rbergen requested changes Sep 3, 2025

View reviewed changes

davepl added 3 commits September 3, 2025 11:46

Update PR

33c5ec9

Fix dockerfile

177719a

Fix native flag

f9ebbaa

Send non-benchmark output to stderr, make label unique

262fef2

rbergen approved these changes Sep 4, 2025

View reviewed changes

davepl merged commit 4f191ff into drag-race Sep 4, 2025
341 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mark multiples optimizations #1000

Mark multiples optimizations #1000

Uh oh!

davepl commented Aug 31, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davepl commented Aug 31, 2025

Uh oh!

Uh oh!

rbergen commented Sep 1, 2025

Uh oh!

davepl commented Sep 3, 2025

Uh oh!

rbergen left a comment

Uh oh!

rbergen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mark multiples optimizations #1000

Mark multiples optimizations #1000

Uh oh!

Conversation

davepl commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davepl commented Aug 31, 2025

Uh oh!

Uh oh!

rbergen commented Sep 1, 2025

Uh oh!

davepl commented Sep 3, 2025

Uh oh!

rbergen left a comment

Choose a reason for hiding this comment

Compliance review: PrimeCPP/solution_5

Uh oh!

rbergen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davepl commented Aug 31, 2025 •

edited

Loading