Skip to content

Commit 2969dbb

Browse files
committed
Doc: Explain server monitoring
1 parent e4312c6 commit 2969dbb

File tree

9 files changed

+226
-0
lines changed

9 files changed

+226
-0
lines changed
137 KB
Loading
156 KB
Loading
Loading
374 KB
Loading
166 KB
Loading
271 KB
Loading
59.4 KB
Loading
Loading

linode/server_monitoring.md

Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
# Monitor Your Server on Linode
2+
3+
Now that your Linode is up and running, it’s time to think about monitoring and maintaining your server. This tutorial discusses some essential tools and skills we can use to check our server's resources. In the process, we will learn how to monitor the availability and performance of our system, manage our logs and update the server's software.
4+
5+
6+
Multiple things go into monitoring a server. For example, we might be interested in monitoring the following aspects of our server:
7+
8+
- [Availability of the server](#availability-of-the-server)
9+
- [Performance of the server](#performance-of-the-server)
10+
11+
It is therefore important to first assess what needs we have before embarking on a server monitoring mission.
12+
13+
## Availability of the Server
14+
15+
Not everyone needs to monitor the availability of their server. If you are running a very basic application such as a morning quote website, you may not necessarily need to worry about service interruptions. Occassional inconviniences of the website going offline for a few minutes may not justify the time it takes to set up and configure an availability monitoring tool.
16+
17+
However, if you depend on your website, say for livelihood, then it is a necessity to monitor your server. Once set up, the tool actively watches the server and services and alerts us when they are unavailable. We will be able to troubleshoot the problem and restore the service as soon as possible.
18+
19+
There are a handful of tools that we can use to monitor the availability of a server.
20+
21+
- If we are running multiple servers, we can use [Elastic Stack](https://www.elastic.co/elastic-stack/). It includes Elasticsearch, Logstash, and Kibana, is a troika of tools that provides a free and open-source solution that searches, collects and analyzes data from any source and in any format and visualizes it in real time. I will not go into the details of how to configure the server to use Elastic Stack for now.
22+
- If we are running only a single server, we may consider using a third-party service to monitor our linode.
23+
- Linode offers [Linode Managed](https://www.linode.com/managed), an expert 24/7 monitoring service. It carries no obligation or contract and costs $100 per month, per Linode on your account.
24+
25+
26+
### Configure Shutdown Watchdog
27+
28+
Occassionally, a linode may power off unexpectedly, making the server unavailable. In this case, Linode offers a shutdown watchdog called Lassie that automatically reboots a linode in such instances. It is not an unavailability monitoring tool, but it is useful in getting a linode back online.
29+
30+
Log in to your Linode to see available Linodes:
31+
![Available linodes](/images/linode/server_monitoring/dashboard.png)
32+
33+
Notice that I have 2 linodes on my account, _official_personal_website_ and _tinkereducationnewsletter_. I will show you how to configure Lassie using the _official_personal_website_ linode.
34+
35+
I will click on it to access more data on the linode. What I am interested in is the "Settings" tab.
36+
![Settings Tab](/images/linode/server_monitoring/settings.png)
37+
38+
Scroll to the bottom of the "Settings" tab to see the "Shutdown Watchdog" section. Toggle the key to enable this feature.
39+
![Lassie](/images/linode/server_monitoring/lassie.png)
40+
41+
42+
## Performance of the Server
43+
44+
For vital server and service performance metrics, performance monitoring tools are used. These tools can be equated to a car's dashboard which shows all car performance details such as speed and fuel consumption. We will begin by first looking at the default tools that monitor performance of a server then gradually check out a few more technical tools we can use.
45+
46+
47+
### Linode Cloud Manager
48+
49+
Once our linode is up and running, Linode offers us the Cloud Manager in our dashboard with a few performance data. This data can be accessed by clicking on a linode, in my case the _official_personal_website_ linode.
50+
51+
![Linode analytics](/images/linode/server_monitoring/analytics.png)
52+
53+
The graph constains the following sections:
54+
55+
- CPU %: It shows how my linode's CPU is being utilized.
56+
- IPv4 network traffic: It keeps tabs with how much incoming and outgoing bandwidth the server is using.
57+
- IPv6 network traffic: It checks how much bandwidth has been transferred to IPv6.
58+
- Disk I/O: It checks the size of my Linode's disk, if it is full or not.
59+
60+
Chances are that you may not understand the graphs. It may be difficult to tell apart what numbers are normal and those that are abnormal.
61+
62+
63+
### Email Alerts
64+
65+
Linode Cloud Manager allows us to configure email alerts that notify us when certain performance thresholds are reached.
66+
67+
![Email Alerts](/images/linode/server_monitoring/email_alerts.png)
68+
69+
In the illustration above, I have configured email notifications when the CPU Usage is 90% and above. To enable a particular threshold, toggle the appropriate switch, set a value and click the "Save" button to save the email alert threshold.
70+
71+
When we receive such an alert, it does not mean there is something wrong with the Linode. It simple means that the server is operating above a set threshold.
72+
73+
74+
## Linux System Monitoring Fundamentals
75+
76+
Monitoring tools help to reassure us when things are working right, they help us to recognize odd behaviour, performance anomalies and their sources when the server misbehaves. All server monitoring tools have a few things in common.
77+
78+
- They set a goal that ensures a server is performing optimally
79+
- Provide administrative data
80+
- Sometimes automate responses to anomalies
81+
82+
Data on each key performance indicator (KPI), network connectivity and application availability is collected and used for analysis. For example, working hardware, available server, server resources are sufficient, no bottlenecks are slowing things down and visualization of data.
83+
84+
Thankfully, we have dozens of server system monitoring tools built into Linux. I will show you how to use the `top` command to see avaiable Linux processes in CPU activity order. Understandably, there are a dozen more such as [System Activity Report (sar)](https://linux.die.net/man/1/sar), [Vmstat](https://linux.die.net/man/8/vmstat), [Monitorix](https://www.monitorix.org/), [Nethogs](https://github.com/raboof/nethogs), [Glance](https://nicolargo.github.io/glances/), [htop](https://htop.dev/) and [Netdata](https://www.netdata.cloud/).
85+
86+
87+
### Monitor Server Performance using `top`
88+
89+
If we can see a server's processor activity in real-time, we are more likely to discover and diagonise any CPU and memory usage problems. The `top` command can assist with monitoring.
90+
91+
In your server's terminal, run the command below:
92+
93+
```python
94+
$ top
95+
```
96+
![Top running](/images/linode/server_monitoring/top_running.png)
97+
98+
This screen contains a variety of information regarding the server.
99+
100+
```python
101+
top - 14:56:17 up 127 days, 22:19, 2 users, load average: 0.01, 0.01, 0.00
102+
```
103+
104+
- The first line contains the **time, the uptime and load averages of the server**. The load average is displayed over 1, 5, and 15 minutes to provide a better overall look at the load my server has undertaken.
105+
- To properly read the load average, we need to know how many CPUs our Linode has. If there is 1 CPU, then a load average of 1.00 eans that the server is operating at its capacity. This number increases to 2 if the number of CPUs is 2, etc.
106+
- A load average of 0.70 for a Linode with 1 core is generally considered a threshold. Anything higher requires reconfiguration of resources or the need to upgrade.
107+
108+
109+
```python
110+
Tasks: 118 total, 1 running, 117 sleeping, 0 stopped, 0 zombie
111+
```
112+
113+
- The second line is a **list of tasks and their various states**.
114+
115+
```python
116+
%Cpu(s): 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
117+
```
118+
119+
- The third line is the **CPU percentages**:
120+
- user CPU time (`us`)
121+
- System CPU time (`sy`)
122+
- Nice time (`ni`) - time spend on low prioity processes
123+
- Idle time (`id`)
124+
- Time spent on wait I/O processes (`wa`)
125+
- Time handling hardware interruptions (`hi`)
126+
- Time handling software interruptions (`si`)
127+
- Steal time (`st`) - time stolen from the virtual machine
128+
129+
- The forth line is the **server's memory usage** in kilobytes
130+
131+
```python
132+
MiB Mem : 976.8 total, 97.0 free, 321.5 used, 558.3 buff/cache
133+
```
134+
135+
- The fifth line is the **server's swap usage** in kilobytes
136+
137+
```python
138+
MiB Swap: 512.0 total, 423.0 free, 89.0 used. 497.1 avail Mem
139+
```
140+
141+
Thereafter, we have a heading with a list of processes and related data
142+
143+
![Top heading](/images/linode/server_monitoring/top_heading.png)
144+
145+
- **PID**: Process ID
146+
- **USER**: The username of the task owner
147+
- **PR**: The task priority from -20 - 19, with -20 being the most important
148+
- **NI**: The _nice value_ which augments the priority of a task. Negative values increase a task's priority while postive values decrease it.
149+
- **VIRT**: The virtual memory (both RAM and swap combined) used
150+
- **RES**: The resident non-swapped, physical memory in kilobytes (usually)
151+
- **SHR**: The shared memory size, or memory that could be allocated to other processes
152+
- **S**: The process status. `R` for running, `D` for sleeping and unable to be interrupted, `S` sleeping and able to be interrupted, `T` for traced/stopped and `Z` for zombie
153+
- **%CPU**: CPU percentage since the last `top` update
154+
- **%MEM**: Memory (RAM) percentage since the last `top` update
155+
- **TIME+**: Cumulative CPU time that the process and children processes have used
156+
- **COMMAND**: Name of process
157+
158+
159+
### `top` commands
160+
161+
The `top` command can be used in conjunction with other commands both on the commandline or interactively. Important commandline options include:
162+
163+
- `-d [interval]`: Sets the delay time that `tip` uses to refresh results
164+
- `-i`: Toggles whether or not the idle processes are shown
165+
- `-p [PID]`: Allows the user to filter `top` so only defined processes are shown
166+
- `-u [username]`: Filters by user
167+
- `-n [limit]`: Sets `top` to run for a set amount of intervals before exiting
168+
- `b`: Runs `top` in batch mode, which is ideal for log files and in conjunction with other programs
169+
170+
```python
171+
$ top -b -p3304014 -d10 -n2
172+
173+
# Output
174+
175+
top - 15:53:58 up 127 days, 23:16, 2 users, load average: 0.09, 0.02, 0.01
176+
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
177+
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
178+
MiB Mem : 976.8 total, 93.7 free, 321.3 used, 561.8 buff/cache
179+
MiB Swap: 512.0 total, 423.0 free, 89.0 used. 497.0 avail Mem
180+
181+
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
182+
3304014 gitauha+ 20 0 90192 59052 7200 S 0.0 5.9 0:08.62 gunicorn
183+
184+
top - 15:54:03 up 127 days, 23:16, 2 users, load average: 0.08, 0.02, 0.00
185+
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
186+
%Cpu(s): 0.2 us, 0.0 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
187+
MiB Mem : 976.8 total, 93.7 free, 321.3 used, 561.8 buff/cache
188+
MiB Swap: 512.0 total, 423.0 free, 89.0 used. 497.0 avail Mem
189+
190+
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
191+
3304014 gitauha+ 20 0 90192 59052 7200 S 0.0 5.9 0:08.62 gunicorn
192+
193+
```
194+
195+
The above `top` command logs the process identified by the PID `3304014` after a delay of 10 seconds in 2 cycles.
196+
197+
Interactively, we can issue the following commands in an active `top` session:
198+
199+
- `return` or `space`: Instantly update the screen
200+
- `d` or `s`: Alter the delay time
201+
- `H`: Show individual threads for all processes
202+
- `i`: Toggles whether idle processes will be displayed
203+
- `U` or `u`: Filter the process by the owner's username
204+
- `k`: Kill a process. You will be prompted to enter the PID
205+
- `q`: Quit
206+
207+
208+
### Commands Similar to `top`
209+
210+
There is [htop](http://hisham.hm/htop/), which is similar to `top`, but offers an easier interface with color, mouse operations, and horizontal and vertical scrolling, making it more intuitive.
211+
212+
To use it, we first need to install it by running th command:
213+
214+
```python
215+
$ sudo apt install htop
216+
```
217+
218+
Running is similar to `top`:
219+
220+
```python
221+
$ htop
222+
```
223+
224+
![htop](/images/linode/server_monitoring/htop.png)
225+
226+
You can use your mouse to scroll the interactive process viewer. You can click on a process using yoru mouse to highlight it then press `k`, for example, to kill it. At the bottom, you will notice a few buttons that you can click on.

0 commit comments

Comments
 (0)