-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configure the wait time for checkpoint safety #18489
Conversation
runtime/criusupport/criusupport.cpp
Outdated
@@ -844,7 +851,12 @@ Java_org_eclipse_openj9_criu_CRIUSupport_checkpointJVMImpl(JNIEnv *env, | |||
for (UDATA i = 0; (0 != notSafeToCheckpoint) && (i <= maxRetries); i++) { | |||
releaseSafeOrExcusiveVMAccess(currentThread, vmFuncs, safePoint); | |||
vmFuncs->internalExitVMToJNI(currentThread); | |||
omrthread_nanosleep(10000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observe that the thread is unable to sleep for less than 1000000 ns (1 ms) on a Linux x86_64 machine.
@tajila requesting your review. |
d833405
to
648fad4
Compare
@tajila requesting your review. |
runtime/criusupport/criusupport.cpp
Outdated
@@ -690,6 +691,12 @@ Java_org_eclipse_openj9_criu_CRIUSupport_checkpointJVMImpl(JNIEnv *env, | |||
UDATA success = 0; | |||
bool safePoint = J9_ARE_ANY_BITS_SET(vm->extendedRuntimeFlags, J9_EXTENDED_RUNTIME_OSR_SAFE_POINT); | |||
UDATA maxRetries = vm->checkpointState.maxRetryForNotCheckpointSafe; | |||
U_64 sleepNanoseconds = vm->checkpointState.sleepNanosecondsForNotCheckpointSafe; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think we need this complexity. By toggling max retry and nanosleep we can express the flexibility needed for this.
runtime/criusupport/criusupport.cpp
Outdated
@@ -844,7 +851,12 @@ Java_org_eclipse_openj9_criu_CRIUSupport_checkpointJVMImpl(JNIEnv *env, | |||
for (UDATA i = 0; (0 != notSafeToCheckpoint) && (i <= maxRetries); i++) { | |||
releaseSafeOrExcusiveVMAccess(currentThread, vmFuncs, safePoint); | |||
vmFuncs->internalExitVMToJNI(currentThread); | |||
omrthread_nanosleep(10000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use omrthread_sleep
and set it to a parameter for which the default is 10ms
runtime/vm/jvminit.c
Outdated
@@ -3926,6 +3926,9 @@ processVMArgsFromFirstToLast(J9JavaVM * vm) | |||
vm->checkpointState.lastRestoreTimeMillis = -1; | |||
/* Its unclear if we need an option for this, so we can keep the init here for the time being */ | |||
vm->checkpointState.maxRetryForNotCheckpointSafe = 100; | |||
vm->checkpointState.sleepNanosecondsForNotCheckpointSafe = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you only need sleepNanosecondsForNotCheckpointSafe
. But this can be milliseconds instead.
648fad4
to
9da089f
Compare
I have simplified the code that increases the wait time for checkpoint safety from 1 ms to 2 s by sleeping for a fixed time per checkpoint safety check. |
You can add two options
And then we can increase those values (higher than default) for this test |
What would be the default, lower bound, and upper bound of the Would throwing an error be appropriate if the |
0 <= maxRetryForNotCheckpointSafe < ∞ 1 <= sleepMillisecondsForNotCheckpointSafe < ∞
Ya we can throw the standard errors. Note, we won't be documenting these options. Its not something we intend users will need. The default should be enough. |
A stacktrace of
|
9da089f
to
467be28
Compare
467be28
to
5b8e8e7
Compare
@tajila I added options to configure the checkpoint safety wait time with Re-requesting your review. |
Can you do some grinders on x, p and z to verify that the test will not fail. Do 50x for each and try to spread them out on different machines. |
The grinder on x passed, the cause of the failures for the grinders on all of the p machines and on some of the z machines seems to be due to the setup of CRIU on the machines, and the grinders do not indicate failures that occur due to a timeout while code is being executed in @NotCheckpointSafe frames or failures that occur due to the test timing out by the race condition. x: https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/36552 |
b7e395e
to
157c145
Compare
okay, enable the test on X only for now. |
Increase the wait time for checkpoint safety from 1 ms to 2 s for threads initializing a class, and fix the failure in the re-enabled MethodTypeDeadlockTest that occurs due to an insufficient wait time for checkpoint safety for threads initializing a class. Issue: eclipse-openj9#15806 Signed-off-by: Amarpreet Singh <[email protected]>
Simplify the code that increases the wait time for checkpoint safety from 1 ms to 2 s by sleeping for a fixed time per checkpoint safety check. Issue: eclipse-openj9#15806 Signed-off-by: Amarpreet Singh <[email protected]>
Add -XX:maxRetryForNotCheckpointSafe= and -XX:sleepMillisecondsForNotCheckpointSafe= to control the checkpoint safety wait time for MethodTypeDeadlockTest. Issue: eclipse-openj9#15806 Signed-off-by: Amarpreet Singh <[email protected]>
Fix the race condition in MethodTypeDeadlockTest for the shared resource testResult.lockStatus by modifying its type from volatile int to AtomicInteger, and avoid the test from timing out when the main thread is blocked while waiting for testResult.lockStatus to reflect the count of the worker threads that were started by the main thread. Issue: eclipse-openj9#15806 Signed-off-by: Amarpreet Singh <[email protected]>
Disable criu_nonPortableRestore tests on s390x due to the limited number of s390x machines with proper CRIU setup. Issue: eclipse-openj9#15806 Signed-off-by: Amarpreet Singh <[email protected]>
157c145
to
daf7103
Compare
I have disabled the test on z, it is now enabled only for x. |
jenkins test sanity alinux64 jdk17 |
Configure the checkpoint safety wait time with
-XX:maxRetryForNotCheckpointSafe= and
-XX:sleepMillisecondsForNotCheckpointSafe=, and increase
the default checkpoint safety wait time from 1 ms to 1
s.
Configure the checkpoint safety wait time for the
re-enabled MethodTypeDeadlockTest to be 2 s to allow the
threads to initialize a class, and fix the failure in
the MethodTypeDeadlockTest that occurs due to an
insufficient checkpoint safety wait time for the threads
initializing a class.
Fix the race condition in MethodTypeDeadlockTest for the
shared resource testResult.lockStatus by modifying its
type from volatile int to AtomicInteger, and avoid the
test from timing out when the main thread is blocked
while waiting for testResult.lockStatus to reflect the
count of the worker threads that were started by the
main thread.
Issue: #15806
Signed-off-by: Amarpreet Singh [email protected]