Skip to content

Models with the same inference_pool_gid still create a new InferencePool and spawn N parallel workers #2274

@dinispeixoto

Description

@dinispeixoto

Description

The PR #2040 introduced a great feature allowing users to define custom inference pools per model, instead of sharing a single pool across models with different loads.

However, there's a small bug in mlserver/parallel/registry.py when the environment is not provided or an environment tarball is available:

if not env_tarball:
    return (
        self._pools.setdefault(
            inference_pool_gid,
            InferencePool(self._settings, on_worker_stop=self._on_worker_stop),
        )
        if inference_pool_gid
        else self._default_pool
    )

If inference_pool_gid already exists in self._pools, a new InferencePool instance is still created (and thus spawns N new worker processes) before setdefault checks for an existing key.

From the InferencePool constructor:

def __init__(
    self,
    settings: Settings,
    env: Optional[Environment] = None,
    on_worker_stop: List[InferencePoolHook] = [],
):
    configure_inference_pool(settings)

    ...
    for _ in range(self._settings.parallel_workers): # spawning Python processes
        worker = _spawn_worker(self._settings, self._responses, self._env)
        self._workers[worker.pid] = worker  # type: ignore

This leads to redundant process creation even when the pool already exists.

Steps to reproduce

  1. Create 2 ML models with the same inference_pool_gid, e.g. in model-settings.json
{
    "name": "foo",
    "implementation": "...",
    "parameters": {
        "inference_pool_gid": "bar"
    }
}
  1. Set the parallel_workers in settings.json to 2
{
    "debug": "true",
    "use_structured_logging": "true",
    "parallel_workers": 2,
}
  1. Start MLServer and check the number of worker processes:
ps -ef | grep spawn_main | grep python | wc -l

Expected: 4 processes (2 for the default pool + 2 for the custom bar pool)
Observed: 6 processes, as InferencePool is instantiated twice.

This can also be demonstrated in a Python shell:

>>> class Foo:
...     def __init__(self):
...             print("hello")
... 
>>> bar = {}
>>> bar.setdefault("1", Foo())
hello
<__main__.Foo object at 0x104916da0>
>>> bar.setdefault("1", Foo())
hello
<__main__.Foo object at 0x104916da0>
>>> 

setdefault() still calls Foo() twice because the argument is evaluated before checking if the key exists.

Impact

  • Orphan processes are spawned unnecessarily.
  • These processes are never used for inference but remain alive.
  • Can lead to high memory usage and degraded performance in production environments with multiple models or high worker counts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions