Skip to content

Conversation

marisn
Copy link
Contributor

@marisn marisn commented Oct 10, 2025

As random file generation keeps internal state in a static variable, it is prone to failures as multiple threads would attempt to update sate simultaneously. This PR adds a mutex lock around internal state update calls thus preventing clashes when called from multiple threads.

Contains code to be merged as PR #6479
Also contains necessary changes in Make/CMake to compile gis lib with pthread support.

Should fix issues observed in r.mapcalc after thread support was merged #5742

@marisn marisn requested review from echoix and nilason October 10, 2025 09:19
@github-actions github-actions bot added C Related code is in C libraries CMake labels Oct 10, 2025
@wenzeslaus
Copy link
Member

The code changes look straightforward. Do you have an idea about performance? I agree we need the change in any case, but at least knowing whether we need a better solution in the future would be good.

@wenzeslaus wenzeslaus changed the title threads: make random number generation thread safe lib: Make random number generation thread safe Oct 10, 2025
@wenzeslaus
Copy link
Member

Good news. This makes the failing rand function tests in #6476 pass at least on Linux.

As random number generator tracks its internal state as a static
variable, it is necessary to guard its updates with mutex to
prevent misbehavour when multiple threads change values simultaneously
@marisn
Copy link
Contributor Author

marisn commented Oct 17, 2025

The code changes look straightforward. Do you have an idea about performance? I agree we need the change in any case, but at least knowing whether we need a better solution in the future would be good.

I haven't tested but locked part is small and should be fast (just update, no IO), thus I would expect impact on speed to be negligible.

Copy link
Member

@wenzeslaus wenzeslaus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the locking needs to expand to reading of the shared state, it is both write and read which need to be isolated.

More generally, I'm afraid we didn't finalize the r.mapcalc's rand behavior with multiple threads (and seeds). (Or did we?) If r.mapcalc is using these functions directly as before parallelization, the random numbers placed in the raster will depend on the order of threads, i.e., what rows will come first, but I did not review the relevant code now. (An ultimate fix there would be different "context" objects for the random numbers, per row (for maximum reproducibility) or per thread (for at least seed+nprocs reproducibility).

*
* \param[in] seedval 32-bit integer used to seed the PRNG
*
* This function is thread-safe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like the following should indicate that this applies only for certain builds. I don't know if you can actually have a combination when the thread safety matters while not having pthreds.

Suggested change
* This function is thread-safe.
* This function is thread-safe (if threading is enabled, i.e., `HAVE_PTHREAD` is defined).

or: ...if pthreads is available,...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree. This makes no sense. Of course there is no thread safety when code is compiled without threads! This is a programmer documentation that is meant to signal other developers that it is safe to call this function from threaded code. Attempts to compile core GRASS without pthreads and calling from a module with pthreads should not be considered in documentation of each function.

Comment on lines 62 to 78
void G_srand48(long seedval)
{
uint32 x = (uint32) * (unsigned long *)&seedval;

#ifdef HAVE_PTHREAD
pthread_mutex_lock(&lrand48_mutex);
#endif

x2 = (uint16)HI(x);
x1 = (uint16)LO(x);
x0 = (uint16)0x330E;
seeded = 1;

#ifdef HAVE_PTHREAD
pthread_mutex_unlock(&lrand48_mutex);
#endif
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this from multiple threads will reset the global state. So, while the locking will prevent creation of a broken global state and it will prevent from reading a half-baked state, it will not work as expected. If you call this from multiple threads, they will replace the previous thread random state. Only the main thread should call this function and typically would call it once unless resetting the seed - that should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. I added a note to the documentation. Mutex guards should remain in place, but seed initialization always should be called only from main process/single thread. I called multiple times, it would result in "last wins" state.

Comment on lines 105 to 130
static void G__next(void)
{
#ifdef HAVE_PTHREAD
pthread_mutex_lock(&lrand48_mutex);
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach only locks the advancing, so there will be no broken state. However, consider this scenario:

  1. in Thread 1, mrand48 calls next
  2. in Thread 1, next locks and does the advancing
  3. in Thread 1, next unlocks and returns
  4. in Thread 2, mrand48 calls next
  5. in Thread 2, next locks and does the advancing
  6. in Thread 2, mrand48 reads the static variable values
  7. in Thread 1, mrand48 reads the static variable values which are the same ones as in Thread 2, so the Thread 1 values were never used

Alternatively, Thread 1 can be reading the half-baked state which is being overwritten by Thread 2.

Either the locking needs to happen at the level of mrand48-type functions, i.e., include both reading and writing (or lock separately from writing), or G__next (or its wrapper) needs to return the values rather than rely on the global state.

I prefer the solution which limits the global state and uses explicit return values (or even passes a state around), so my choice would be to return (pass) values from G__next. (In a more substantial rewrite scenario, an explicit "context" object would make the sharing of state explicit.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are spot-on about too narrow scope of mutex. I changed code to include also consumption of state inside the mutex.

The new implementation makes RNG serial – independently from number of threads calling those functions, number sequence always will be the same.

If G__next would return a state object, each thread would have its own state and, with an identical seed, result in an identical stream of random numbers – not good at all. If state is passed around between calls (e.g. G_lrand48(struct state)) we are even worse as then locking/sharing of state would have to be implemented by each caller itself and we still are facing thread synchronization problem. Did I miss something?

Copy link
Member

@wenzeslaus wenzeslaus Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation makes RNG serial – independently from number of threads calling those functions, number sequence always will be the same.

Awesome, thanks. That's what I had in mind. Least changes in terms of code structure, although most mutex additions.

If G__next would return a state object, each thread would have its own state and, with an identical seed, result in an identical stream of random numbers – not good at all.

This requires starting the different states with different seeds. This works well for different stochastic replicates of the same simulation. It would require additional decisions on how to behave within one raster.

If state is passed around between calls (e.g. G_lrand48(struct state)) we are even worse as then locking/sharing of state would have to be implemented by each caller itself and we still are facing thread synchronization problem.

The mutex would have to be part of that state and access would have to be only through functions, basically an OOP encapsulation. In other words, every variable from the library would have to go to the state.

Edit: That is if we want the locking for the states, but my original idea was that independent states don't need locking because there is nothing shared between them, and they are simply not suitable to be used across threads. In other words, a not thread-safe API meant to be used with different instances in different threads which would be then a thread-safe usage pattern.

@wenzeslaus
Copy link
Member

I haven't tested but locked part is small and should be fast (just update, no IO), thus I would expect impact on speed to be negligible.

No IO, but it is called for every cell in r.mapcalc, no? Still better slow with overall faster code than a segfault, and rand() is already much slower than let's say a constant (I did not test recently). If it is slow, we could use a solution with an explicit state (context) which may allow us to avoid locking, while having a separate locking version of the API.

multiple threads

As random number generation is guarded by a mutex, it makes generation
a serial operation – always leading to the same sequence of numbers
@github-actions github-actions bot added Python Related code is in Python tests Related to Test Suite labels Oct 20, 2025
@marisn
Copy link
Contributor Author

marisn commented Oct 20, 2025

This is the worst case scenario – code calls random number generation and discards produced numbers (thread creation + locking overhead):

Total random calls per run: 100,000,000
Runs per thread count: 5

Threads Avg. Time (s) Overhead vs. 1 Thread
1 0.8202 +0.0000s (+0.00%)
2 3.3390 +2.5187s (+307.07%)
4 3.2877 +2.4674s (+300.81%)
6 3.5505 +2.7303s (+332.86%)
8 4.3752 +3.5549s (+433.40%)
10 4.9465 +4.1263s (+503.05%)
12 5.7690 +4.9488s (+603.32%)
16 6.2315 +5.4113s (+659.71%)

@marisn
Copy link
Contributor Author

marisn commented Oct 20, 2025

No IO, but it is called for every cell in r.mapcalc, no? Still better slow with overall faster code than a segfault, and rand() is already much slower than let's say a constant (I did not test recently). If it is slow, we could use a solution with an explicit state (context) which may allow us to avoid locking, while having a separate locking version of the API.

RN generation now is always serial – outcome order is guaranteed. Consumption is another topic. I lean towards just stating that for fully reproducible outcome one should set parameters for single thread execution.
Generally it is out of scope of this PR and should be discussed separately with more people involved.

@wenzeslaus
Copy link
Member

A random sequence can contain duplicates, so your test is failing, but it is because you don't have to (can't) have that requirement.

AssertionError: 10000 != 9996 : Duplicate values found in multi-threaded run, indicating a race condition.

I think only the comparison of sorted lists gives you a true idea of what happened.

@wenzeslaus
Copy link
Member

...outcome order is guaranteed.

From library perspective, but never in the actual usage.

Consumption is another topic.

So I would say just don't overstate the outcome order guarantees in the doc.

I lean towards just stating that for fully reproducible outcome one should set parameters for single thread execution.

That would be my choice at this point as well.

Generally it is out of scope of this PR and should be discussed separately with more people involved.

I agree.

"""Test of gis library lrand48 PRNG thread-safety
@author Maris Nartiss
@author Gemini
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need. Human author is, or at least should be, the owner of the code in ways which matter to us. Approach to copyright differs in between countries and I would not even venture into it.

If there is a important provenance info, it should go to the commit message (and/or PR description).

We will not be inviting Gemini to a code sprint or to join the development team, nor expecting more insights into a code it generated comparing to any other code.

Comment on lines +40 to +46
# --- Define ctypes function signatures ---
G_srand48.argtypes = [ctypes.c_long]
G_srand48.restype = None

G_lrand48.argtypes = []
G_lrand48.restype = ctypes.c_long

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this ctypesgen job? Isn't that already done?

G_lrand48.argtypes = []
G_lrand48.restype = ctypes.c_long

# --- 1. Single-threaded execution ---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While often seen in generated code, I think our experience is that these ASCII comment graphic is hard to maintain and inherently inconsistent (thinking about old code we cleaned in GRASS).

Comment on lines +7 to +13
find_package(Threads)

set(grass_gis_DEFS "-DGRASS_VERSION_DATE=\"${GRASS_VERSION_DATE}\"")
if(Threads_FOUND)
list(APPEND grass_gis_DEFS "HAVE_PTHREAD")
endif()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this should be replaced by the addition of check_target(Threads::Threads HAVE_PTHREAD) to this file:

check_target(PROJ::proj HAVE_PROJ_H)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C Related code is in C CMake libraries Python Related code is in Python tests Related to Test Suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants