We should identify additional metrics that can be used to reason about server "capacity"

@patrickbrophy and @brianaydemir, feel free to chime in if you think I'm misrepresenting any of our conversation.

Right now, the Director's only real notion for "is server X at/nearing its capacity to serve clients" comes from checking the server's "status" field, which can set statuses like "degraded" or "critical".

At the moment, a server's "degraded" status is only triggered when the current value of the `xrootd_server_active_io` metric reaches some threshold relative to the limit configured via `{Cache,Origin}.Concurrency`. But we've started to identify other ways in which servers can reach capacity. For example, a Cache/Origin is at/nearing capacity when:
- XRootD is unable to spawn additional threads because it has hit the limit configured with `Xrootd.MaxThreads`.
- the server's total available bandwidth is in use such that additional clients result in reduced bandwidth per client
- the server is running out of memory (not sure we've observed this, but it's a useful example)

A key observation is that there are multiple "axes" of capacity, and that they can be observed independently even if some of them are related. Moving forward, we should try to identify as many of these as possible to start using them as triggers for degraded or critical statuses.

I'm spitballing here, but I think any metric is a good candidate when it meets these criteria:
- it directly correlates to the usage of a limited resource that, when exhausted, causes "poor performance" (@brianaydemir, I think explaining what "poor performance" encompasses means describing the failure mode, such as when you pointed out that hitting a thread limit will cause instant failures whereas bandwidth issues get gradually worse -- we can probably use this to decide whether something constitutes a critical or a degraded status.)
- we can determine a rough upper bound on the amount of that resource that's available to the server, either programmatically or via admin-supplied configuration (and we can provide the admin guidance for how to set a limit if it's their responsibility).
- we can determine when the server reaches some usage threshold for the resource relative to the total available

Is there anything I'm missing here?

Finally, @patrickbrophy pointed out we should revisit the corpus of XRootD monitoring statistics available to us that we don't currently plumb into Pelican to see whether any of them might be useful for this exercise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

We should identify additional metrics that can be used to reason about server "capacity" #2807

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

We should identify additional metrics that can be used to reason about server "capacity" #2807

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions