It would be helpful to have further documentation on deployment recommendations in a production setting.
For example:
- Should the parameter server / lighthouse server be colocated for performance? Is it necessary to have high speed interconnect between the lighthouse server and the worker nodes?
- Can a single lighthouse server be shared amongst multiple training jobs? If so, how are the instances/jobs distinguished from each other?
- What kind of minimum specs are recommended for the lighthouse / parameter servers? How does this relate to model size?