Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job.status and job.log returns error for certain domain name pattern #150

Open
tedhtchang opened this issue Jun 8, 2023 · 4 comments
Open

Comments

@tedhtchang
Copy link
Member

tedhtchang commented Jun 8, 2023

The job.status() and job.logs() Are failing, they're showing

gaierror                                  Traceback (most recent call last)
File /opt/app-root/lib64/python3.8/site-packages/urllib3/connection.py:174, in HTTPConnection._new_conn(self)
    173 try:
--> 174     conn = connection.create_connection(
    175         (self._dns_host, self.port), self.timeout, **extra_kw
    176     )
    178 except SocketTimeout:
....
ConnectionError: Failed to connect to Ray at address: http://ray/.

Cause:
In torchx The app_handle is a string. On my cluster it looks like
ray://torchx/ray-dashboard-mnisttest-default.tedchang-codeflare-test-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.us-south.containers.appdomain.cloud-mnisttest-trwmgjb2s041jc

This method uses a regx that not parse the app_id(mnisttest-trwmgjb2s041jc) and ray dashboard(ray-dashboard-mnisttest-default.tedchang-codeflare-test-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.us-south.containers.appdomain.cloud) property
":\d+-|.com-|.org-|.net-|.gov-|.io-"

/opt/app-root/lib64/python3.8/site-packages/torchx/schedulers/ray_scheduler.py

330         def _parse_app_id(self, app_id: str) -> Tuple[str, str]:
331             # find index of '-' in the first :\d+-
332             m = re.search(r":\d+-|.com-|.org-|.net-|.gov-|.io-" , app_id)
333             if m:
334                 sep = m.span()[1]
335                 addr = app_id[: sep-1]
336                 app_id = app_id[sep:]
337                 return addr, app_id
338 ~~~~~~~~~~~~~~~~
339             addr, _, app_id = app_id.partition("-")
340             return addr, app_id

something like this worked for my cluster but may not be robust
":\d+-|.com-|.org-|.net-|.gov-|.io-|.cloud-"

@tedhtchang tedhtchang assigned KPostOffice and unassigned KPostOffice Jun 8, 2023
@tedhtchang
Copy link
Member Author

tedhtchang commented Jun 8, 2023

@KPostOffice @jbusche @MichaelClifford Our _parse_app_id looks different from the one in the upstream. Is it a fork which we can make it work for now ?

@KPostOffice
Copy link
Collaborator

The one upstream was more broken than the one in our custom version. We can definitely make changes, I'm not sure how to make it more generic exactly. Maybe something including an reverse parse for . since I think the name cannot have periods in it, so the from last period to the - must indicate the top level domain.

@tedhtchang
Copy link
Member Author

@KPostOffice Reverse parsing sounds like a great idea!. Where is the custom torchx repo located ? I will see if can create a PR there.

@MichaelClifford
Copy link
Collaborator

Thanks for pointing this out @tedhtchang. I recall we ran into this issue before, and we could not determine a regex that provide stable results, so we added a list of reasonable suffixes.

Likely worth revisiting to see if we can implement a general solution.

In the mean time a quick fix would be just to add .cloud- to our fork of torchx (https://github.com/project-codeflare/torchx/tree/OCP)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants