Skip to content

Commit dbf22de

Browse files
QA Process report for v0.37.x (and baseline for v0.34.x) (backport tendermint#9499) (tendermint#9578)
* QA Process report for v0.37.x (and baseline for v0.34.x) (tendermint#9499) * 1st version. 200 nodes. Missing rotating node * Small fixes * Addressed @jmalicevic's comment * Explain in method how to set the tmint version to test. Improve result section * 1st version of how to run the 'rotating node' testnet * Apply suggestions from @williambanfield Co-authored-by: William Banfield <[email protected]> * Addressed @williambanfield's comments * Added reference to Unix load metric * Added total TXs * Fixed some 'png's that got swapped. Excluded '.*-node-exporter' processes from memory plots * Report for rotating node * Adressed remaining comments from @williambanfield * Cosmetic * Addressed some of @thanethomson's comments * Re-executed the 200 node tests and updated the corresponding sections of the report * Ignore Python virtualenv directories Signed-off-by: Thane Thomson <[email protected]> * Add latency vs throughput script Signed-off-by: Thane Thomson <[email protected]> * Add README for latency vs throughput script Signed-off-by: Thane Thomson <[email protected]> * Fix local links to folders Signed-off-by: Thane Thomson <[email protected]> * v034: only have one level-1 heading Signed-off-by: Thane Thomson <[email protected]> * Adjust headings Signed-off-by: Thane Thomson <[email protected]> * v0.37.x: add links to issues/PRs Signed-off-by: Thane Thomson <[email protected]> * v0.37.x: add note about bug being present in v0.34 Signed-off-by: Thane Thomson <[email protected]> * method: adjust heading depths Signed-off-by: Thane Thomson <[email protected]> * Show data points on latency vs throughput plot Signed-off-by: Thane Thomson <[email protected]> * Add latency vs throughput plots Signed-off-by: Thane Thomson <[email protected]> * Correct mentioning of v0.34.21 and add heading Signed-off-by: Thane Thomson <[email protected]> * Refactor latency vs throughput script Update the latency vs throughput script to rather generate plots from the "raw" CSV output from the loadtime reporting tool as opposed to the separated CSV files from the experimental method. Also update the relevant documentation, and regenerate the images from the raw CSV data (resulting in pretty much the same plots as the previous ones). Signed-off-by: Thane Thomson <[email protected]> * Remove unused default duration const Signed-off-by: Thane Thomson <[email protected]> * Adjust experiment start time to be more accurate and re-plot latency vs throughput Signed-off-by: Thane Thomson <[email protected]> * Addressed @williambanfield's comments * Apply suggestions from code review Co-authored-by: William Banfield <[email protected]> * Apply suggestions from code review Co-authored-by: William Banfield <[email protected]> * scripts: Update latency vs throughput readme for clarity Signed-off-by: Thane Thomson <[email protected]> Signed-off-by: Thane Thomson <[email protected]> Co-authored-by: William Banfield <[email protected]> Co-authored-by: Thane Thomson <[email protected]> (cherry picked from commit b06e1ce) * Remove v037 dir * Removed reference to v0.37 testnets Co-authored-by: Sergio Mena <[email protected]>
1 parent 3ec6e42 commit dbf22de

29 files changed

+797
-0
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -65,3 +65,5 @@ test/fuzz/**/*.zip
6565
*.pdf
6666
*.gz
6767
*.dvi
68+
# Python virtual environments
69+
.venv

docs/qa/README.md

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
order: 1
3+
parent:
4+
title: Tendermint Quality Assurance
5+
description: This is a report on the process followed and results obtained when running v0.34.x on testnets
6+
order: 2
7+
---
8+
9+
# Tendermint Quality Assurance
10+
11+
This directory keeps track of the process followed by the Tendermint Core team
12+
for Quality Assurance before cutting a release.
13+
This directory is to live in multiple branches. On each release branch,
14+
the contents of this directory reflect the status of the process
15+
at the time the Quality Assurance process was applied for that release.
16+
17+
File [method](./method.md) keeps track of the process followed to obtain the results
18+
used to decide if a release is passing the Quality Assurance process.
19+
The results obtained in each release are stored in their own directory.
20+
The following releases have undergone the Quality Assurance process:
21+
22+
* [v0.34.x](./v034/), which was tested just before releasing v0.34.22

docs/qa/method.md

+214
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
---
2+
order: 1
3+
title: Method
4+
---
5+
6+
# Method
7+
8+
This document provides a detailed description of the QA process.
9+
It is intended to be used by engineers reproducing the experimental setup for future tests of Tendermint.
10+
11+
The (first iteration of the) QA process as described [in the RELEASES.md document][releases]
12+
was applied to version v0.34.x in order to have a set of results acting as benchmarking baseline.
13+
This baseline is then compared with results obtained in later versions.
14+
15+
Out of the testnet-based test cases described in [the releases document][releases] we focused on two of them:
16+
_200 Node Test_, and _Rotating Nodes Test_.
17+
18+
[releases]: https://github.com/tendermint/tendermint/blob/v0.37.x/RELEASES.md#large-scale-testnets
19+
20+
## Software Dependencies
21+
22+
### Infrastructure Requirements to Run the Tests
23+
24+
* An account at Digital Ocean (DO), with a high droplet limit (>202)
25+
* The machine to orchestrate the tests should have the following installed:
26+
* A clone of the [testnet repository][testnet-repo]
27+
* This repository contains all the scripts mentioned in the reminder of this section
28+
* [Digital Ocean CLI][doctl]
29+
* [Terraform CLI][Terraform]
30+
* [Ansible CLI][Ansible]
31+
32+
[testnet-repo]: https://github.com/interchainio/tendermint-testnet
33+
[Ansible]: https://docs.ansible.com/ansible/latest/index.html
34+
[Terraform]: https://www.terraform.io/docs
35+
[doctl]: https://docs.digitalocean.com/reference/doctl/how-to/install/
36+
37+
### Requirements for Result Extraction
38+
39+
* Matlab or Octave
40+
* [Prometheus][prometheus] server installed
41+
* blockstore DB of one of the full nodes in the testnet
42+
* Prometheus DB
43+
44+
[prometheus]: https://prometheus.io/
45+
46+
## 200 Node Testnet
47+
48+
### Running the test
49+
50+
This section explains how the tests were carried out for reproducibility purposes.
51+
52+
1. [If you haven't done it before]
53+
Follow steps 1-4 of the `README.md` at the top of the testnet repository to configure Terraform, and `doctl`.
54+
2. Copy file `testnets/testnet200.toml` onto `testnet.toml` (do NOT commit this change)
55+
3. Set the variable `VERSION_TAG` in the `Makefile` to the git hash that is to be tested.
56+
4. Follow steps 5-10 of the `README.md` to configure and start the 200 node testnet
57+
* WARNING: Do NOT forget to run `make terraform-destroy` as soon as you are done with the tests (see step 9)
58+
5. As a sanity check, connect to the Prometheus node's web interface and check the graph for the `tendermint_consensus_height` metric.
59+
All nodes should be increasing their heights.
60+
6. `ssh` into the `testnet-load-runner`, then copy script `script/200-node-loadscript.sh` and run it from the load runner node.
61+
* Before running it, you need to edit the script to provide the IP address of a full node.
62+
This node will receive all transactions from the load runner node.
63+
* This script will take about 40 mins to run
64+
* It is running 90-seconds-long experiments in a loop with different loads
65+
7. Run `make retrieve-data` to gather all relevant data from the testnet into the orchestrating machine
66+
8. Verify that the data was collected without errors
67+
* at least one blockstore DB for a Tendermint validator
68+
* the Prometheus database from the Prometheus node
69+
* for extra care, you can run `zip -T` on the `prometheus.zip` file and (one of) the `blockstore.db.zip` file(s)
70+
9. **Run `make terraform-destroy`**
71+
* Don't forget to type `yes`! Otherwise you're in trouble.
72+
73+
### Result Extraction
74+
75+
The method for extracting the results described here is highly manual (and exploratory) at this stage.
76+
The Core team should improve it at every iteration to increase the amount of automation.
77+
78+
#### Steps
79+
80+
1. Unzip the blockstore into a directory
81+
2. Extract the latency report and the raw latencies for all the experiments. Run these commands from the directory containing the blockstore
82+
* `go run github.com/tendermint/tendermint/test/loadtime/cmd/report@3ec6e424d --database-type goleveldb --data-dir ./ > results/report.txt`
83+
* `go run github.com/tendermint/tendermint/test/loadtime/cmd/report@3ec6e424d --database-type goleveldb --data-dir ./ --csv results/raw.csv`
84+
3. File `report.txt` contains an unordered list of experiments with varying concurrent connections and transaction rate
85+
* Create files `report01.txt`, `report02.txt`, `report04.txt` and, for each experiment in file `report.txt`,
86+
copy its related lines to the filename that matches the number of connections.
87+
* Sort the experiments in `report01.txt` in ascending tx rate order. Likewise for `report02.txt` and `report04.txt`.
88+
4. Generate file `report_tabbed.txt` by showing the contents `report01.txt`, `report02.txt`, `report04.txt` side by side
89+
* This effectively creates a table where rows are a particular tx rate and columns are a particular number of websocket connections.
90+
5. Extract the raw latencies from file `raw.csv` using the following bash loop. This creates a `.csv` file and a `.dat` file per experiment.
91+
The format of the `.dat` files is amenable to loading them as matrices in Octave
92+
93+
```bash
94+
uuids=($(cat report01.txt report02.txt report04.txt | grep '^Experiment ID: ' | awk '{ print $3 }'))
95+
c=1
96+
for i in 01 02 04; do
97+
for j in 0025 0050 0100 0200; do
98+
echo $i $j $c "${uuids[$c]}"
99+
filename=c${i}_r${j}
100+
grep ${uuids[$c]} raw.csv > ${filename}.csv
101+
cat ${filename}.csv | tr , ' ' | awk '{ print $2, $3 }' > ${filename}.dat
102+
c=$(expr $c + 1)
103+
done
104+
done
105+
```
106+
107+
6. Enter Octave
108+
7. Load all `.dat` files generated in step 5 into matrices using this Octave code snippet
109+
110+
```octave
111+
conns = { "01"; "02"; "04" };
112+
rates = { "0025"; "0050"; "0100"; "0200" };
113+
for i = 1:length(conns)
114+
for j = 1:length(rates)
115+
filename = strcat("c", conns{i}, "_r", rates{j}, ".dat");
116+
load("-ascii", filename);
117+
endfor
118+
endfor
119+
```
120+
121+
8. Set variable release to the current release undergoing QA
122+
123+
```octave
124+
release = "v0.34.x";
125+
```
126+
127+
9. Generate a plot with all (or some) experiments, where the X axis is the experiment time,
128+
and the y axis is the latency of transactions.
129+
The following snippet plots all experiments.
130+
131+
```octave
132+
legends = {};
133+
hold off;
134+
for i = 1:length(conns)
135+
for j = 1:length(rates)
136+
data_name = strcat("c", conns{i}, "_r", rates{j});
137+
l = strcat("c=", conns{i}, " r=", rates{j});
138+
m = eval(data_name); plot((m(:,1) - min(m(:,1))) / 1e+9, m(:,2) / 1e+9, ".");
139+
hold on;
140+
legends(1, end+1) = l;
141+
endfor
142+
endfor
143+
legend(legends, "location", "northeastoutside");
144+
xlabel("experiment time (s)");
145+
ylabel("latency (s)");
146+
t = sprintf("200-node testnet - %s", release);
147+
title(t);
148+
```
149+
150+
10. Consider adjusting the axis, in case you want to compare your results to the baseline, for instance
151+
152+
```octave
153+
axis([0, 100, 0, 30], "tic");
154+
```
155+
156+
11. Use Octave's GUI menu to save the plot (e.g. as `.png`)
157+
158+
12. Repeat steps 9 and 10 to obtain as many plots as deemed necessary.
159+
160+
13. To generate a latency vs throughput plot, using the raw CSV file generated
161+
in step 2, follow the instructions for the [`latency_throughput.py`] script.
162+
163+
[`latency_throughput.py`]: ../../scripts/qa/reporting/README.md
164+
165+
#### Extracting Prometheus Metrics
166+
167+
1. Stop the prometheus server if it is running as a service (e.g. a `systemd` unit).
168+
2. Unzip the prometheus database retrieved from the testnet, and move it to replace the
169+
local prometheus database.
170+
3. Start the prometheus server and make sure no error logs appear at start up.
171+
4. Introduce the metrics you want to gather or plot.
172+
173+
## Rotating Node Testnet
174+
175+
### Running the test
176+
177+
This section explains how the tests were carried out for reproducibility purposes.
178+
179+
1. [If you haven't done it before]
180+
Follow steps 1-4 of the `README.md` at the top of the testnet repository to configure Terraform, and `doctl`.
181+
2. Copy file `testnet_rotating.toml` onto `testnet.toml` (do NOT commit this change)
182+
3. Set variable `VERSION_TAG` to the git hash that is to be tested.
183+
4. Run `make terraform-apply EPHEMERAL_SIZE=25`
184+
* WARNING: Do NOT forget to run `make terraform-destroy` as soon as you are done with the tests
185+
5. Follow steps 6-10 of the `README.md` to configure and start the "stable" part of the rotating node testnet
186+
6. As a sanity check, connect to the Prometheus node's web interface and check the graph for the `tendermint_consensus_height` metric.
187+
All nodes should be increasing their heights.
188+
7. On a different shell,
189+
* run `make runload ROTATE_CONNECTIONS=X ROTATE_TX_RATE=Y`
190+
* `X` and `Y` should reflect a load below the saturation point (see, e.g.,
191+
[this paragraph](./v034/README.md#finding-the-saturation-point) for further info)
192+
8. Run `make rotate` to start the script that creates the ephemeral nodes, and kills them when they are caught up.
193+
* WARNING: If you run this command from your laptop, the laptop needs to be up and connected for full length
194+
of the experiment.
195+
9. When the height of the chain reaches 3000, stop the `make rotate` script
196+
10. When the rotate script has made two iterations (i.e., all ephemeral nodes have caught up twice)
197+
after height 3000 was reached, stop `make rotate`
198+
11. Run `make retrieve-data` to gather all relevant data from the testnet into the orchestrating machine
199+
12. Verify that the data was collected without errors
200+
* at least one blockstore DB for a Tendermint validator
201+
* the Prometheus database from the Prometheus node
202+
* for extra care, you can run `zip -T` on the `prometheus.zip` file and (one of) the `blockstore.db.zip` file(s)
203+
13. **Run `make terraform-destroy`**
204+
205+
Steps 8 to 10 are highly manual at the moment and will be improved in next iterations.
206+
207+
### Result Extraction
208+
209+
In order to obtain a latency plot, follow the instructions above for the 200 node experiment, but:
210+
211+
* The `results.txt` file contains only one experiment
212+
* Therefore, no need for any `for` loops
213+
214+
As for prometheus, the same method as for the 200 node experiment can be applied.

0 commit comments

Comments
 (0)