Skip to content

SQSSessionCallbackScheduler race condition at close leave threads in waiting state for ever #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wigtor opened this issue Dec 23, 2021 · 3 comments

Comments

@wigtor
Copy link

wigtor commented Dec 23, 2021

I have found a race condition when you create a SQSSession and inmediatly close it (For example when I only send a message I don't need any thread for listen messages).
In a production system after 2 days sending around 10000 messages, I have more than 100 threads in waiting state in line 102 of SQSSessionCallbackScheduler class (I see it using jstack command)
At construct a SQSSession, it construct an object SQSSessionCallbackScheduler and execute in a ExecutorService object tha make a thread. So, I have 2 threads: main thread and SQSSessionCallbackScheduler thread, the execution order is:

  • SQSSessionCallbackScheduler thread check "closed" variable and it is false (line 95 to line 97), context switch occurs
    before the synchronized section in line 98.
  • main thread call close() method, change "closed" variable to true
  • main thread enter to synchronized section
  • main thread call notify() and it is never received by the QSSessionCallbackScheduler thread
  • main thread exit of synchronized section, context switch occurs
  • SQSSessionCallbackScheduler thread enter to synchronized section (line 98)
  • SQSSessionCallbackScheduler thread check callbackEntry == null (line 100)
  • SQSSessionCallbackScheduler thread call wait() (line 102)

I propose check "closed" variable too into the synchronized section (after line 98)
I attach an image of code explaining it.
Screenshot_20211222_232601
image

@vghero
Copy link

vghero commented Mar 11, 2022

That looks similar to what I saw during an AWS SQS outage a couple of days ago, where the DNS lookup of the SQS endpoint were failing with UnknownHostException. The number of threads was rising very quickly until OOM errors occured after 15 minutes - reaching 30k of those threads. Since the DNS lookup fails quickly, I guess it has the same issue like explained here. I will take a deeper look.

image

This also looks similar: #47

@tomhunte
Copy link

Official update to fix this issue has been released today. Please let us know if this continues.

@ziyanli-amazon
Copy link
Contributor

Please see the (latest release https://github.com/awslabs/amazon-sqs-java-messaging-lib/releases/tag/1.0.9)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants