sql-machine-learning
diff --git a/‎docs/designs/archived/model_evaluation.md
+56-39 b/‎docs/designs/archived/model_evaluation.md
+56-39
diff --git a/‎docs/designs/archived/odps_support.md
+31-17 b/‎docs/designs/archived/odps_support.md
+31-17
diff --git a/‎docs/designs/archived/overview.md
+18-6 b/‎docs/designs/archived/overview.md
+18-6
@@ -1,57 +1,74 @@
-## Model Evaluation Design
+# Model Evaluation Design
 
 This document describes the design of model evaluation task for ElasticDL.
 
-### Minimal Viable Product
+## Minimal Viable Product
 
-#### Definitions
+### Definitions
 
-* `Model evaluation`: Computing metrics to judge the performance of the trained model.
-* `Evaluation worker`: The worker responsible for performing model evaluation task.
-* `Multiprocessing`: Executing tasks in multiple threads in parallel on the same pod.
+- `Model evaluation`: Computing metrics to judge the performance of the trained
+model.
+- `Evaluation worker`: The worker responsible for performing model evaluation
+task.
+- `Multiprocessing`: Executing tasks in multiple threads in parallel on the
+same pod.
 
-#### Requirements
+### Requirements
 
-* There's only one evaluation worker without multiprocessing.
-* Master pod is responsible for creating the evaluation worker.
-* Evaluation worker is created by master pod together with the workers for training.
-* Evaluation starts after a specified warm-up period and on a given time interval. For example, we need to expose
-    the following parameters to users:
-    * `start_delay_secs`: Start evaluating after waiting for this many seconds.
-    * `throttle_secs`: Do not re-evaluate unless the last evaluation was started at least this many seconds ago.
-* The evaluation worker fetches the latest model from master pod.
-* Model can be evaluated by a specified number of steps or batches of evaluation samples. If `None`,
+- There's only one evaluation worker without multiprocessing.
+- Master pod is responsible for creating the evaluation worker.
+- Evaluation worker is created by master pod together with the workers for
+training.
+- Evaluation starts after a specified warm-up period and on a given time
+interval. For example, we need to expose the following parameters to users:
+  - `start_delay_secs`: Start evaluating after waiting for this many seconds.
+  - `throttle_secs`: Do not re-evaluate unless the last evaluation was
+started at least this many seconds ago.
+- The evaluation worker fetches the latest model from master pod.
+- Model can be evaluated by a specified number of steps or batches of
+evaluation samples. If `None`,
     evaluation will continue until reaching the end of input.
-* Model evaluation metrics can be defined by users together with the model definition.
-* The computed model evaluation metrics can be report back to master through RPC call.
+- Model evaluation metrics can be defined by users together with the model
+definition.
+- The computed model evaluation metrics can be report back to master through
+RPC call.
 
-#### Implementation Plan
+### Implementation Plan
 
-* Implement `MasterServicer.ReportEvaluationMetrics()` and additional proto definitions such as
+- Implement `MasterServicer.ReportEvaluationMetrics()` and additional proto
+definitions such as
     `ReportEvaluationMetricsReply` and `ReportEvaluationMetricsRequest`.
-* Extend `Worker` to support the following:
-    * `distributed_evaluate()` that contains the main logic for model evaluation.
-    * `report_task_result()` that reports evaluation task result (e.g. task id and error message) back to master through RPC call.
-    * `report_evaluation_metrics()` that reports the computed evaluation metrics (e.g. accuracy, precision, recall, etc.) back to master through RPC call.
-* Add main CLI entry-point to `Worker.distributed_evaluate()` that will be used in `WorkerManager`.
-* Extend `WorkerManager` to support the following:
-    * Instantiate a separate evaluation task queue from evaluation data directory.
-    * Start an evaluation worker from evaluation task queue.
-    * Update `master.main()` to support model evaluation task if user requested.
-
-### Future Development
+- Extend `Worker` to support the following:
+  - `distributed_evaluate()` that contains the main logic for model
+evaluation.
+  - `report_task_result()` that reports evaluation task result (e.g. task id
+and error message) back to master through RPC call.
+  - `report_evaluation_metrics()` that reports the computed evaluation
+metrics (e.g. accuracy, precision, recall, etc.) back to master through RPC
+call.
+- Add main CLI entry-point to `Worker.distributed_evaluate()` that will be used
+in `WorkerManager`.
+- Extend `WorkerManager` to support the following:
+  - Instantiate a separate evaluation task queue from evaluation data
+directory.
+  - Start an evaluation worker from evaluation task queue.
+  - Update `master.main()` to support model evaluation task if user requested.
+
+## Future Development
 
 A list of potential features we may want for model evaluation in the future:
 
-* `num_parallel_processes`: The number of children processes to run evaluation on each individual evaluation worker.
-* `sample_weights`: Optional Numpy array of weights for the test samples, used for weighting the loss function.
+- `num_parallel_processes`: The number of children processes to run evaluation
+on each individual evaluation worker.
+- `sample_weights`: Optional Numpy array of weights for the test samples, used
+for weighting the loss function.
 
-### References
+## References
 
 Some of the ideas are borrowed from existing solutions listed below:
 
-* [`tf.keras.models.Model.evaluate()`](https://www.tensorflow.org/api_docs/python/tf/keras/models/Model#evaluate)
-* [`tf.keras.metrics`](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)
-* [`tf.estimator.EvalSpec`](https://www.tensorflow.org/api_docs/python/tf/estimator/EvalSpec)
-* [`tf.estimator.Estimator.evaluate()`](https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#evaluate)
-* [`tf.estimator.train_and_evaluate()`](https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate)
+- [`tf.keras.models.Model.evaluate()`](https://www.tensorflow.org/api_docs/python/tf/keras/models/Model#evaluate)
+- [`tf.keras.metrics`](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)
+- [`tf.estimator.EvalSpec`](https://www.tensorflow.org/api_docs/python/tf/estimator/EvalSpec)
+- [`tf.estimator.Estimator.evaluate()`](https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#evaluate)
+- [`tf.estimator.train_and_evaluate()`](https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate)
@@ -1,10 +1,11 @@
-## Design for ODPS Data Source Support
+# Design for ODPS Data Source Support
 
 This document describes the design for supporting ODPS data source in ElasticDL.
 
-### Existing `ODPSReader` Class
+## Existing `ODPSReader` Class
 
-The interface to read data from ODPS with the existing `ODPSReader` is defined as follows:
+The interface to read data from ODPS with the existing `ODPSReader` is defined
+as follows:
 
 ````python
 class ODPSReader(object):
@@ -42,8 +43,10 @@ class ODPSReader(object):
         pass
 ````
 
-For example, if we have 5 workers in total, in the first worker, we can run the following
-to load the ODPS table into a Python iterator where each batch contains 100 rows:
+For example, if we have 5 workers in total, in the first worker, we can run the
+following
+to load the ODPS table into a Python iterator where each batch contains 100
+rows:
 
 ```python
 reader = ODPSReader(...)
@@ -52,10 +55,12 @@ for batch in data_iterator:
     print("Batch size %d\n. Data: %s" % (len(batch), batch))
 ```
 
-### Support ODPS Data Source in ElasticDL
+## Support ODPS Data Source in ElasticDL
 
-The current `Worker` relies heavily on `TaskDispatcher` and RecordIO which overlaps with the
-existing `ODPSReader` so we could not use the above existing `ODPSReader` directly. For example,
+The current `Worker` relies heavily on `TaskDispatcher` and RecordIO which
+overlaps with the
+existing `ODPSReader` so we could not use the above existing `ODPSReader`
+directly. For example,
 `Worker` does the following related steps:
 
 1. `recordio.Scanner(task.shard_file_name, task.start, task.end - task.start)`
@@ -67,15 +72,24 @@ fetch the features and labels from this batch.
 
 Here's a list of things we need to do in order to support ODPS data source:
 
-1. Create `ODPSReader` based on user-provided ODPS information such as ODPS project name and credentials.
-2. Implement a method to create training and evaluation shards based on table name and column names instead of
+1. Create `ODPSReader` based on user-provided ODPS information such as ODPS
+project name and credentials.
+2. Implement a method to create training and evaluation shards based on table
+name and column names instead of
 RecordIO data directories, and then pass the shards to `TaskDispatcher`.
-3. Modify `Worker` to support instantiating a `ODPSReader` in addition to RecordIO reader.
-4. Implement `ODPSReader.record()` for `Worker._get_batch()` to use. Alternatively, we can also re-implement
-`Worker._get_batch()` so it can get the whole batch of data rows if the data source is ODPS. This is because
-the current implementation of `Worker._get_batch()` method contains a for loop that reads one record at a time
+3. Modify `Worker` to support instantiating a `ODPSReader` in addition to
+RecordIO reader.
+4. Implement `ODPSReader.record()` for `Worker._get_batch()` to use.
+Alternatively, we can also re-implement
+`Worker._get_batch()` so it can get the whole batch of data rows if the data
+source is ODPS. This is because
+the current implementation of `Worker._get_batch()` method contains a for loop
+that reads one record at a time
 which is inefficient.
 
-Once the above work is done and we have a clearer picture, we could then think about how to allow users to plug
-in their custom data readers like `ODPSReader` so that they don't have to convert data to RecordIO format, which
-could avoid the IO overhead. This should be discussed further in high-level API designs.
+Once the above work is done and we have a clearer picture, we could then think
+about how to allow users to plug
+in their custom data readers like `ODPSReader` so that they don't have to
+convert data to RecordIO format, which
+could avoid the IO overhead. This should be discussed further in high-level API
+designs.
@@ -1,15 +1,27 @@
 # Overview
 
-ElasticDL is a framework implements the swamp optimization meta-algorithm. It is like Apache Hadoop is a framework that implements the MapReduce parallel programming paradigm.
+ElasticDL is a framework implements the swamp optimization meta-algorithm. It
+is like Apache Hadoop is a framework that implements the MapReduce parallel
+programming paradigm.
 
-To program the ElasticDL framework, programmers need to provide at least one `nn.Module`-derived class that describes the specification of a model. It is like programmers of Hadoop need to provide a class that implements the methods of Map and Reduce.
+To program the ElasticDL framework, programmers need to provide at least one
+`nn.Module`-derived class that describes the specification of a model. It is
+like programmers of Hadoop need to provide a class that implements the methods
+of Map and Reduce.
 
-To train a model, ElasticDL needs (1) hyperparameter values, and (2) the data.  Each ElasticDL job uses the same data to train one or more models, where each model needs a set of hyperparameter values. A model could have more than one sets of hyperparameter values.  In such a case, they are considered multiple models.
+To train a model, ElasticDL needs (1) hyperparameter values, and (2) the data.
+Each ElasticDL job uses the same data to train one or more models, where each
+model needs a set of hyperparameter values. A model could have more than one
+sets of hyperparameter values.  In such a case, they are considered multiple
+models.
 
 - A job is associated with a dataset.
-- A job is associated with one or more model specifications, each model specification is a Python classed derived from torch.nn.Module.
-- A model specification is associated with one or more sets of hyperparameter values.
-- The pair of a model specification and a set of its hyperparameter values is a model.
+- A job is associated with one or more model specifications, each model
+specification is a Python classed derived from torch.nn.Module.
+- A model specification is associated with one or more sets of hyperparameter
+values.
+- The pair of a model specification and a set of its hyperparameter values is a
+model.
 - A job includes a coordinator process and one or more bee processes.
 - A bee process trains one or more models.
 - The coordinator dispatches models to bees.