Skip to content

Commit e16c6fb

Browse files
authored
Reformat docs/design/archived/*.md (#2064)
1 parent 9b2fe4a commit e16c6fb

File tree

5 files changed

+315
-182
lines changed

5 files changed

+315
-182
lines changed
+56-39
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,74 @@
1-
## Model Evaluation Design
1+
# Model Evaluation Design
22

33
This document describes the design of model evaluation task for ElasticDL.
44

5-
### Minimal Viable Product
5+
## Minimal Viable Product
66

7-
#### Definitions
7+
### Definitions
88

9-
* `Model evaluation`: Computing metrics to judge the performance of the trained model.
10-
* `Evaluation worker`: The worker responsible for performing model evaluation task.
11-
* `Multiprocessing`: Executing tasks in multiple threads in parallel on the same pod.
9+
- `Model evaluation`: Computing metrics to judge the performance of the trained
10+
model.
11+
- `Evaluation worker`: The worker responsible for performing model evaluation
12+
task.
13+
- `Multiprocessing`: Executing tasks in multiple threads in parallel on the
14+
same pod.
1215

13-
#### Requirements
16+
### Requirements
1417

15-
* There's only one evaluation worker without multiprocessing.
16-
* Master pod is responsible for creating the evaluation worker.
17-
* Evaluation worker is created by master pod together with the workers for training.
18-
* Evaluation starts after a specified warm-up period and on a given time interval. For example, we need to expose
19-
the following parameters to users:
20-
* `start_delay_secs`: Start evaluating after waiting for this many seconds.
21-
* `throttle_secs`: Do not re-evaluate unless the last evaluation was started at least this many seconds ago.
22-
* The evaluation worker fetches the latest model from master pod.
23-
* Model can be evaluated by a specified number of steps or batches of evaluation samples. If `None`,
18+
- There's only one evaluation worker without multiprocessing.
19+
- Master pod is responsible for creating the evaluation worker.
20+
- Evaluation worker is created by master pod together with the workers for
21+
training.
22+
- Evaluation starts after a specified warm-up period and on a given time
23+
interval. For example, we need to expose the following parameters to users:
24+
- `start_delay_secs`: Start evaluating after waiting for this many seconds.
25+
- `throttle_secs`: Do not re-evaluate unless the last evaluation was
26+
started at least this many seconds ago.
27+
- The evaluation worker fetches the latest model from master pod.
28+
- Model can be evaluated by a specified number of steps or batches of
29+
evaluation samples. If `None`,
2430
evaluation will continue until reaching the end of input.
25-
* Model evaluation metrics can be defined by users together with the model definition.
26-
* The computed model evaluation metrics can be report back to master through RPC call.
31+
- Model evaluation metrics can be defined by users together with the model
32+
definition.
33+
- The computed model evaluation metrics can be report back to master through
34+
RPC call.
2735

28-
#### Implementation Plan
36+
### Implementation Plan
2937

30-
* Implement `MasterServicer.ReportEvaluationMetrics()` and additional proto definitions such as
38+
- Implement `MasterServicer.ReportEvaluationMetrics()` and additional proto
39+
definitions such as
3140
`ReportEvaluationMetricsReply` and `ReportEvaluationMetricsRequest`.
32-
* Extend `Worker` to support the following:
33-
* `distributed_evaluate()` that contains the main logic for model evaluation.
34-
* `report_task_result()` that reports evaluation task result (e.g. task id and error message) back to master through RPC call.
35-
* `report_evaluation_metrics()` that reports the computed evaluation metrics (e.g. accuracy, precision, recall, etc.) back to master through RPC call.
36-
* Add main CLI entry-point to `Worker.distributed_evaluate()` that will be used in `WorkerManager`.
37-
* Extend `WorkerManager` to support the following:
38-
* Instantiate a separate evaluation task queue from evaluation data directory.
39-
* Start an evaluation worker from evaluation task queue.
40-
* Update `master.main()` to support model evaluation task if user requested.
41-
42-
### Future Development
41+
- Extend `Worker` to support the following:
42+
- `distributed_evaluate()` that contains the main logic for model
43+
evaluation.
44+
- `report_task_result()` that reports evaluation task result (e.g. task id
45+
and error message) back to master through RPC call.
46+
- `report_evaluation_metrics()` that reports the computed evaluation
47+
metrics (e.g. accuracy, precision, recall, etc.) back to master through RPC
48+
call.
49+
- Add main CLI entry-point to `Worker.distributed_evaluate()` that will be used
50+
in `WorkerManager`.
51+
- Extend `WorkerManager` to support the following:
52+
- Instantiate a separate evaluation task queue from evaluation data
53+
directory.
54+
- Start an evaluation worker from evaluation task queue.
55+
- Update `master.main()` to support model evaluation task if user requested.
56+
57+
## Future Development
4358

4459
A list of potential features we may want for model evaluation in the future:
4560

46-
* `num_parallel_processes`: The number of children processes to run evaluation on each individual evaluation worker.
47-
* `sample_weights`: Optional Numpy array of weights for the test samples, used for weighting the loss function.
61+
- `num_parallel_processes`: The number of children processes to run evaluation
62+
on each individual evaluation worker.
63+
- `sample_weights`: Optional Numpy array of weights for the test samples, used
64+
for weighting the loss function.
4865

49-
### References
66+
## References
5067

5168
Some of the ideas are borrowed from existing solutions listed below:
5269

53-
* [`tf.keras.models.Model.evaluate()`](https://www.tensorflow.org/api_docs/python/tf/keras/models/Model#evaluate)
54-
* [`tf.keras.metrics`](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)
55-
* [`tf.estimator.EvalSpec`](https://www.tensorflow.org/api_docs/python/tf/estimator/EvalSpec)
56-
* [`tf.estimator.Estimator.evaluate()`](https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#evaluate)
57-
* [`tf.estimator.train_and_evaluate()`](https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate)
70+
- [`tf.keras.models.Model.evaluate()`](https://www.tensorflow.org/api_docs/python/tf/keras/models/Model#evaluate)
71+
- [`tf.keras.metrics`](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)
72+
- [`tf.estimator.EvalSpec`](https://www.tensorflow.org/api_docs/python/tf/estimator/EvalSpec)
73+
- [`tf.estimator.Estimator.evaluate()`](https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#evaluate)
74+
- [`tf.estimator.train_and_evaluate()`](https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate)

docs/designs/archived/odps_support.md

+31-17
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
1-
## Design for ODPS Data Source Support
1+
# Design for ODPS Data Source Support
22

33
This document describes the design for supporting ODPS data source in ElasticDL.
44

5-
### Existing `ODPSReader` Class
5+
## Existing `ODPSReader` Class
66

7-
The interface to read data from ODPS with the existing `ODPSReader` is defined as follows:
7+
The interface to read data from ODPS with the existing `ODPSReader` is defined
8+
as follows:
89

910
````python
1011
class ODPSReader(object):
@@ -42,8 +43,10 @@ class ODPSReader(object):
4243
pass
4344
````
4445

45-
For example, if we have 5 workers in total, in the first worker, we can run the following
46-
to load the ODPS table into a Python iterator where each batch contains 100 rows:
46+
For example, if we have 5 workers in total, in the first worker, we can run the
47+
following
48+
to load the ODPS table into a Python iterator where each batch contains 100
49+
rows:
4750

4851
```python
4952
reader = ODPSReader(...)
@@ -52,10 +55,12 @@ for batch in data_iterator:
5255
print("Batch size %d\n. Data: %s" % (len(batch), batch))
5356
```
5457

55-
### Support ODPS Data Source in ElasticDL
58+
## Support ODPS Data Source in ElasticDL
5659

57-
The current `Worker` relies heavily on `TaskDispatcher` and RecordIO which overlaps with the
58-
existing `ODPSReader` so we could not use the above existing `ODPSReader` directly. For example,
60+
The current `Worker` relies heavily on `TaskDispatcher` and RecordIO which
61+
overlaps with the
62+
existing `ODPSReader` so we could not use the above existing `ODPSReader`
63+
directly. For example,
5964
`Worker` does the following related steps:
6065

6166
1. `recordio.Scanner(task.shard_file_name, task.start, task.end - task.start)`
@@ -67,15 +72,24 @@ fetch the features and labels from this batch.
6772

6873
Here's a list of things we need to do in order to support ODPS data source:
6974

70-
1. Create `ODPSReader` based on user-provided ODPS information such as ODPS project name and credentials.
71-
2. Implement a method to create training and evaluation shards based on table name and column names instead of
75+
1. Create `ODPSReader` based on user-provided ODPS information such as ODPS
76+
project name and credentials.
77+
2. Implement a method to create training and evaluation shards based on table
78+
name and column names instead of
7279
RecordIO data directories, and then pass the shards to `TaskDispatcher`.
73-
3. Modify `Worker` to support instantiating a `ODPSReader` in addition to RecordIO reader.
74-
4. Implement `ODPSReader.record()` for `Worker._get_batch()` to use. Alternatively, we can also re-implement
75-
`Worker._get_batch()` so it can get the whole batch of data rows if the data source is ODPS. This is because
76-
the current implementation of `Worker._get_batch()` method contains a for loop that reads one record at a time
80+
3. Modify `Worker` to support instantiating a `ODPSReader` in addition to
81+
RecordIO reader.
82+
4. Implement `ODPSReader.record()` for `Worker._get_batch()` to use.
83+
Alternatively, we can also re-implement
84+
`Worker._get_batch()` so it can get the whole batch of data rows if the data
85+
source is ODPS. This is because
86+
the current implementation of `Worker._get_batch()` method contains a for loop
87+
that reads one record at a time
7788
which is inefficient.
7889

79-
Once the above work is done and we have a clearer picture, we could then think about how to allow users to plug
80-
in their custom data readers like `ODPSReader` so that they don't have to convert data to RecordIO format, which
81-
could avoid the IO overhead. This should be discussed further in high-level API designs.
90+
Once the above work is done and we have a clearer picture, we could then think
91+
about how to allow users to plug
92+
in their custom data readers like `ODPSReader` so that they don't have to
93+
convert data to RecordIO format, which
94+
could avoid the IO overhead. This should be discussed further in high-level API
95+
designs.

docs/designs/archived/overview.md

+18-6
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,27 @@
11
# Overview
22

3-
ElasticDL is a framework implements the swamp optimization meta-algorithm. It is like Apache Hadoop is a framework that implements the MapReduce parallel programming paradigm.
3+
ElasticDL is a framework implements the swamp optimization meta-algorithm. It
4+
is like Apache Hadoop is a framework that implements the MapReduce parallel
5+
programming paradigm.
46

5-
To program the ElasticDL framework, programmers need to provide at least one `nn.Module`-derived class that describes the specification of a model. It is like programmers of Hadoop need to provide a class that implements the methods of Map and Reduce.
7+
To program the ElasticDL framework, programmers need to provide at least one
8+
`nn.Module`-derived class that describes the specification of a model. It is
9+
like programmers of Hadoop need to provide a class that implements the methods
10+
of Map and Reduce.
611

7-
To train a model, ElasticDL needs (1) hyperparameter values, and (2) the data. Each ElasticDL job uses the same data to train one or more models, where each model needs a set of hyperparameter values. A model could have more than one sets of hyperparameter values. In such a case, they are considered multiple models.
12+
To train a model, ElasticDL needs (1) hyperparameter values, and (2) the data.
13+
Each ElasticDL job uses the same data to train one or more models, where each
14+
model needs a set of hyperparameter values. A model could have more than one
15+
sets of hyperparameter values. In such a case, they are considered multiple
16+
models.
817

918
- A job is associated with a dataset.
10-
- A job is associated with one or more model specifications, each model specification is a Python classed derived from torch.nn.Module.
11-
- A model specification is associated with one or more sets of hyperparameter values.
12-
- The pair of a model specification and a set of its hyperparameter values is a model.
19+
- A job is associated with one or more model specifications, each model
20+
specification is a Python classed derived from torch.nn.Module.
21+
- A model specification is associated with one or more sets of hyperparameter
22+
values.
23+
- The pair of a model specification and a set of its hyperparameter values is a
24+
model.
1325
- A job includes a coordinator process and one or more bee processes.
1426
- A bee process trains one or more models.
1527
- The coordinator dispatches models to bees.

0 commit comments

Comments
 (0)