Skip to content

Commit f0927d8

Browse files
committed
Clean up YAML after job completion, add example configs for dry_run and cifar benchmarks
1 parent 9833fb4 commit f0927d8

File tree

7 files changed

+118
-15
lines changed

7 files changed

+118
-15
lines changed

benchmark/configs/cifar_cpu/cifar_cpu_ctnr.yml renamed to benchmark/configs/cifar_cpu/cifar_cpu_docker.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# Configuration file of FAR training experiment using Aggregator & Executor containers
1+
# Configuration file of FAR training experiment using Aggregator & Executor containers and docker for container deployment
22

33
# ========== Container configuration ==========
44
# whether to use container deployment
5-
use_container: True
5+
use_container: docker
66

77
# containers need port-mapping to communicate with host machine
88
# E.g., 1 aggregator and 2 executor, ports: [Aggr, Exec1, Exec2]
@@ -47,15 +47,15 @@ setup_commands:
4747

4848
# We use fixed paths in job_conf as they will be accessed inside containers
4949
job_conf:
50-
- job_name: cifar_ctnr # Generate logs under this folder: log_path/job_name/time_stamp
50+
- job_name: cifar_docker # Generate logs under this folder: log_path/job_name/time_stamp
5151
- log_path: /FedScale/benchmark # Path of log files
5252
- num_participants: 4 # Number of participants per round, we use K=100 in our paper, large K will be much slower
5353
- data_set: cifar10 # Dataset: openImg, google_speech, stackoverflow
5454
- data_dir: /FedScale/benchmark/dataset/data/ # Path of the dataset
5555
- model: shufflenet_v2_x2_0 # NOTE: Please refer to our model zoo README and use models for these small image (e.g., 32x32x3) inputs
5656
# - model_zoo: fedscale-zoo # Default zoo (torchcv) uses the pytorchvision zoo, which can not support small images well
5757
- eval_interval: 10 # How many rounds to run a testing on the testing set
58-
- rounds: 20 # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
58+
- rounds: 21 # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
5959
- filter_less: 0 # Remove clients w/ less than 21 samples
6060
- num_loaders: 2
6161
- local_steps: 20
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Configuration file of FAR training experiment using Aggregator & Executor containers and k8s for container deployment
2+
3+
# ========== Container configuration ==========
4+
# whether to use container deployment
5+
use_container: k8s
6+
7+
# containers need a data-path mount to facilitate dataset reuse
8+
# We assume the same data-path is used on all host machines
9+
data_path: $FEDSCALE_HOME/benchmark
10+
11+
# ========== Cluster configuration ==========
12+
# k8s-specific
13+
# number of aggregators, right now we only support a single aggregator
14+
# placeholder for supporting hierarchical aggregator in the future
15+
num_aggregators: 1
16+
17+
# k8s-specific
18+
# number of executors
19+
num_executors: 2
20+
21+
auth:
22+
ssh_user: ""
23+
ssh_private_key: ~/.ssh/id_rsa
24+
25+
# cmd to run before we can indeed run FAR (in order)
26+
setup_commands:
27+
28+
29+
# ========== Additional job configuration ==========
30+
# Default parameters are specified in config_parser.py, wherein more description of the parameter can be found
31+
32+
# We use fixed paths in job_conf as they will be accessed inside containers
33+
job_conf:
34+
- job_name: cifar_k8s # Generate logs under this folder: log_path/job_name/time_stamp
35+
- log_path: /FedScale/benchmark # Path of log files
36+
- num_participants: 4 # Number of participants per round, we use K=100 in our paper, large K will be much slower
37+
- data_set: cifar10 # Dataset: openImg, google_speech, stackoverflow
38+
- data_dir: /FedScale/benchmark/dataset/data/ # Path of the dataset
39+
- model: shufflenet_v2_x2_0 # NOTE: Please refer to our model zoo README and use models for these small image (e.g., 32x32x3) inputs
40+
# - model_zoo: fedscale-zoo # Default zoo (torchcv) uses the pytorchvision zoo, which can not support small images well
41+
- eval_interval: 10 # How many rounds to run a testing on the testing set
42+
- rounds: 21 # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
43+
- filter_less: 0 # Remove clients w/ less than 21 samples
44+
- num_loaders: 2
45+
- local_steps: 20
46+
- learning_rate: 0.05
47+
- batch_size: 32
48+
- test_bsz: 32
49+
- use_cuda: False

benchmark/configs/dry_run/dry_run_ctnr.yml renamed to benchmark/configs/dry_run/dry_run_docker.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# Configuration file of dry run experiment using Aggregator & Executor containers
1+
# Configuration file of dry run experiment using Aggregator & Executor containers and docker for container deployment
22

33
# ========== Container configuration ==========
44
# whether to use container deployment
5-
use_container: True
5+
use_container: docker
66

77
# containers need port-mapping to communicate with host machine
88
# E.g., 1 aggregator and 2 executor, ports: [Aggr, Exec1, Exec2]
@@ -48,7 +48,7 @@ setup_commands:
4848

4949
# We use fixed paths in job_conf as they will be accessed inside containers
5050
job_conf:
51-
- job_name: dryrun_ctnr # Generate logs under this folder: log_path/job_name/time_stamp
51+
- job_name: dryrun_docker # Generate logs under this folder: log_path/job_name/time_stamp
5252
- log_path: /FedScale/benchmark # Path of log files
5353
- num_participants: 4 # Number of participants per round, we use K=100 in our paper, large K will be much slower
5454
- data_set: cifar10 # Dataset: openImg, google_speech, stackoverflow
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Configuration file of dry run experiment using Aggregator & Executor containers and k8s for container deployment
2+
3+
# ========== Container configuration ==========
4+
# whether to use container deployment
5+
use_container: k8s
6+
7+
# containers need a data-path mount to facilitate dataset reuse
8+
# We assume the same data-path is used on all host machines
9+
data_path: $FEDSCALE_HOME/benchmark
10+
11+
# ========== Cluster configuration ==========
12+
# k8s-specific
13+
# number of aggregators, right now we only support a single aggregator
14+
# placeholder for supporting hierarchical aggregator in the future
15+
num_aggregators: 1
16+
17+
# k8s-specific
18+
# number of executors
19+
num_executors: 2
20+
21+
auth:
22+
ssh_user: ""
23+
ssh_private_key: ~/.ssh/id_rsa
24+
25+
# cmd to run before we can indeed run FAR (in order)
26+
setup_commands:
27+
28+
29+
# ========== Additional job configuration ==========
30+
# Default parameters are specified in config_parser.py, wherein more description of the parameter can be found
31+
32+
# We use fixed paths in job_conf as they will be accessed inside containers
33+
job_conf:
34+
- job_name: dryrun_k8s # Generate logs under this folder: log_path/job_name/time_stamp
35+
- log_path: /FedScale/benchmark # Path of log files
36+
- num_participants: 4 # Number of participants per round, we use K=100 in our paper, large K will be much slower
37+
- data_set: cifar10 # Dataset: openImg, google_speech, stackoverflow
38+
- data_dir: /FedScale/benchmark/dataset/data/ # Path of the dataset
39+
- model: resnet18 # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2# - gradient_policy: yogi # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
40+
- eval_interval: 10 # How many rounds to run a testing on the testing set
41+
- rounds: 21 # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
42+
- filter_less: 0 # Remove clients w/ less than 21 samples
43+
- num_loaders: 2
44+
- local_steps: 20
45+
- learning_rate: 0.001
46+
- batch_size: 32
47+
- test_bsz: 32
48+
- use_cuda: False

benchmark/configs/femnist/conf_docker.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ setup_commands:
4848

4949
# We use fixed paths in job_conf as they will be accessed inside containers
5050
job_conf:
51-
- job_name: femnist_ctnr # Generate logs under this folder: log_path/job_name/time_stamp
51+
- job_name: femnist_docker # Generate logs under this folder: log_path/job_name/time_stamp
5252
- log_path: /FedScale/benchmark # Path of log files
5353
- num_participants: 50 # Number of participants per round, we use K=100 in our paper, large K will be much slower
5454
- data_set: femnist # Dataset: openImg, google_speech, stackoverflow

benchmark/configs/femnist/conf_k8s.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ use_container: k8s
66

77
# containers need a data-path mount to facilitate dataset reuse
88
# We assume the same data-path is used on all host machines
9-
data_path: /users/yilegu/benchmark
9+
data_path: $FEDSCALE_HOME/benchmark
1010

1111
# ========== Cluster configuration ==========
1212
# k8s-specific
@@ -20,7 +20,7 @@ num_executors: 2
2020

2121

2222
auth:
23-
ssh_user: "yilegu"
23+
ssh_user: ""
2424
ssh_private_key: ~/.ssh/id_rsa
2525

2626
# cmd to run before we can indeed run FAR (in order)
@@ -32,7 +32,7 @@ setup_commands:
3232

3333
# We use fixed paths in job_conf as they will be accessed inside containers
3434
job_conf:
35-
- job_name: femnist_ctnr # Generate logs under this folder: log_path/job_name/time_stamp
35+
- job_name: femnist_k8s # Generate logs under this folder: log_path/job_name/time_stamp
3636
- log_path: /FedScale/benchmark # Path of log files
3737
- num_participants: 5 # Number of participants per round, we use K=100 in our paper, large K will be much slower
3838
- data_set: femnist # Dataset: openImg, google_speech, stackoverflow

docker/driver.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -263,6 +263,9 @@ def terminate(job_name):
263263
config.load_kube_config()
264264
core_api = client.CoreV1Api()
265265
for name, meta_dict in job_meta['k8s_dict'].items():
266+
if os.path.exists(meta_dict["yaml_path"]):
267+
os.remove(meta_dict["yaml_path"])
268+
266269
print(f"Shutting down container {name}...")
267270
core_api.delete_namespaced_pod(name, namespace="default")
268271

@@ -328,12 +331,14 @@ def submit_to_k8s(yaml_conf):
328331
"data_path": yaml_conf["data_path"],
329332
"pod_name": exec_name
330333
}
334+
335+
exec_yaml_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), f'{exec_name}.yaml')
336+
generate_exec_template(exec_config, exec_yaml_path)
331337
k8s_dict[exec_name] = {
332338
"type": "executor",
333-
"rank_id": rank_id
339+
"rank_id": rank_id,
340+
"yaml_path": exec_yaml_path
334341
}
335-
exec_yaml_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), f'{exec_name}.yaml')
336-
generate_exec_template(exec_config, exec_yaml_path)
337342
print(f'Submitting executor container {exec_name} to k8s...')
338343
# TODO: logging?
339344
utils.create_from_yaml(k8s_client, exec_yaml_path, namespace="default")
@@ -355,7 +360,8 @@ def submit_to_k8s(yaml_conf):
355360
k8s_dict[aggr_name] = {
356361
"type": "aggregator",
357362
"ip": aggr_ip,
358-
"rank_id": 0
363+
"rank_id": 0,
364+
"yaml_path": aggr_yaml_path
359365
}
360366

361367
# TODO: refactor the code so that docker/k8s version invoke the same init function

0 commit comments

Comments
 (0)