Skip to content

We should identify additional metrics that can be used to reason about server "capacity" #2807

@jhiemstrawisc

Description

@jhiemstrawisc

@patrickbrophy and @brianaydemir, feel free to chime in if you think I'm misrepresenting any of our conversation.

Right now, the Director's only real notion for "is server X at/nearing its capacity to serve clients" comes from checking the server's "status" field, which can set statuses like "degraded" or "critical".

At the moment, a server's "degraded" status is only triggered when the current value of the xrootd_server_active_io metric reaches some threshold relative to the limit configured via {Cache,Origin}.Concurrency. But we've started to identify other ways in which servers can reach capacity. For example, a Cache/Origin is at/nearing capacity when:

  • XRootD is unable to spawn additional threads because it has hit the limit configured with Xrootd.MaxThreads.
  • the server's total available bandwidth is in use such that additional clients result in reduced bandwidth per client
  • the server is running out of memory (not sure we've observed this, but it's a useful example)

A key observation is that there are multiple "axes" of capacity, and that they can be observed independently even if some of them are related. Moving forward, we should try to identify as many of these as possible to start using them as triggers for degraded or critical statuses.

I'm spitballing here, but I think any metric is a good candidate when it meets these criteria:

  • it directly correlates to the usage of a limited resource that, when exhausted, causes "poor performance" (@brianaydemir, I think explaining what "poor performance" encompasses means describing the failure mode, such as when you pointed out that hitting a thread limit will cause instant failures whereas bandwidth issues get gradually worse -- we can probably use this to decide whether something constitutes a critical or a degraded status.)
  • we can determine a rough upper bound on the amount of that resource that's available to the server, either programmatically or via admin-supplied configuration (and we can provide the admin guidance for how to set a limit if it's their responsibility).
  • we can determine when the server reaches some usage threshold for the resource relative to the total available

Is there anything I'm missing here?

Finally, @patrickbrophy pointed out we should revisit the corpus of XRootD monitoring statistics available to us that we don't currently plumb into Pelican to see whether any of them might be useful for this exercise.

Metadata

Metadata

Assignees

Labels

cacheIssue relating to the cache componentdirectorIssue relating to the director componentenhancementNew feature or requestmonitoringoriginIssue relating to the origin component

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions