|
| 1 | +# Monitor Your Server on Linode |
| 2 | + |
| 3 | +Now that your Linode is up and running, it’s time to think about monitoring and maintaining your server. This tutorial discusses some essential tools and skills we can use to check our server's resources. In the process, we will learn how to monitor the availability and performance of our system, manage our logs and update the server's software. |
| 4 | + |
| 5 | + |
| 6 | +Multiple things go into monitoring a server. For example, we might be interested in monitoring the following aspects of our server: |
| 7 | + |
| 8 | +- [Availability of the server](#availability-of-the-server) |
| 9 | +- [Performance of the server](#performance-of-the-server) |
| 10 | + |
| 11 | +It is therefore important to first assess what needs we have before embarking on a server monitoring mission. |
| 12 | + |
| 13 | +## Availability of the Server |
| 14 | + |
| 15 | +Not everyone needs to monitor the availability of their server. If you are running a very basic application such as a morning quote website, you may not necessarily need to worry about service interruptions. Occassional inconviniences of the website going offline for a few minutes may not justify the time it takes to set up and configure an availability monitoring tool. |
| 16 | + |
| 17 | +However, if you depend on your website, say for livelihood, then it is a necessity to monitor your server. Once set up, the tool actively watches the server and services and alerts us when they are unavailable. We will be able to troubleshoot the problem and restore the service as soon as possible. |
| 18 | + |
| 19 | +There are a handful of tools that we can use to monitor the availability of a server. |
| 20 | + |
| 21 | +- If we are running multiple servers, we can use [Elastic Stack](https://www.elastic.co/elastic-stack/). It includes Elasticsearch, Logstash, and Kibana, is a troika of tools that provides a free and open-source solution that searches, collects and analyzes data from any source and in any format and visualizes it in real time. I will not go into the details of how to configure the server to use Elastic Stack for now. |
| 22 | +- If we are running only a single server, we may consider using a third-party service to monitor our linode. |
| 23 | +- Linode offers [Linode Managed](https://www.linode.com/managed), an expert 24/7 monitoring service. It carries no obligation or contract and costs $100 per month, per Linode on your account. |
| 24 | + |
| 25 | + |
| 26 | +### Configure Shutdown Watchdog |
| 27 | + |
| 28 | +Occassionally, a linode may power off unexpectedly, making the server unavailable. In this case, Linode offers a shutdown watchdog called Lassie that automatically reboots a linode in such instances. It is not an unavailability monitoring tool, but it is useful in getting a linode back online. |
| 29 | + |
| 30 | +Log in to your Linode to see available Linodes: |
| 31 | + |
| 32 | + |
| 33 | +Notice that I have 2 linodes on my account, _official_personal_website_ and _tinkereducationnewsletter_. I will show you how to configure Lassie using the _official_personal_website_ linode. |
| 34 | + |
| 35 | +I will click on it to access more data on the linode. What I am interested in is the "Settings" tab. |
| 36 | + |
| 37 | + |
| 38 | +Scroll to the bottom of the "Settings" tab to see the "Shutdown Watchdog" section. Toggle the key to enable this feature. |
| 39 | + |
| 40 | + |
| 41 | + |
| 42 | +## Performance of the Server |
| 43 | + |
| 44 | +For vital server and service performance metrics, performance monitoring tools are used. These tools can be equated to a car's dashboard which shows all car performance details such as speed and fuel consumption. We will begin by first looking at the default tools that monitor performance of a server then gradually check out a few more technical tools we can use. |
| 45 | + |
| 46 | + |
| 47 | +### Linode Cloud Manager |
| 48 | + |
| 49 | +Once our linode is up and running, Linode offers us the Cloud Manager in our dashboard with a few performance data. This data can be accessed by clicking on a linode, in my case the _official_personal_website_ linode. |
| 50 | + |
| 51 | + |
| 52 | + |
| 53 | +The graph constains the following sections: |
| 54 | + |
| 55 | +- CPU %: It shows how my linode's CPU is being utilized. |
| 56 | +- IPv4 network traffic: It keeps tabs with how much incoming and outgoing bandwidth the server is using. |
| 57 | +- IPv6 network traffic: It checks how much bandwidth has been transferred to IPv6. |
| 58 | +- Disk I/O: It checks the size of my Linode's disk, if it is full or not. |
| 59 | + |
| 60 | +Chances are that you may not understand the graphs. It may be difficult to tell apart what numbers are normal and those that are abnormal. |
| 61 | + |
| 62 | + |
| 63 | +### Email Alerts |
| 64 | + |
| 65 | +Linode Cloud Manager allows us to configure email alerts that notify us when certain performance thresholds are reached. |
| 66 | + |
| 67 | + |
| 68 | + |
| 69 | +In the illustration above, I have configured email notifications when the CPU Usage is 90% and above. To enable a particular threshold, toggle the appropriate switch, set a value and click the "Save" button to save the email alert threshold. |
| 70 | + |
| 71 | +When we receive such an alert, it does not mean there is something wrong with the Linode. It simple means that the server is operating above a set threshold. |
| 72 | + |
| 73 | + |
| 74 | +## Linux System Monitoring Fundamentals |
| 75 | + |
| 76 | +Monitoring tools help to reassure us when things are working right, they help us to recognize odd behaviour, performance anomalies and their sources when the server misbehaves. All server monitoring tools have a few things in common. |
| 77 | + |
| 78 | +- They set a goal that ensures a server is performing optimally |
| 79 | +- Provide administrative data |
| 80 | +- Sometimes automate responses to anomalies |
| 81 | + |
| 82 | +Data on each key performance indicator (KPI), network connectivity and application availability is collected and used for analysis. For example, working hardware, available server, server resources are sufficient, no bottlenecks are slowing things down and visualization of data. |
| 83 | + |
| 84 | +Thankfully, we have dozens of server system monitoring tools built into Linux. I will show you how to use the `top` command to see avaiable Linux processes in CPU activity order. Understandably, there are a dozen more such as [System Activity Report (sar)](https://linux.die.net/man/1/sar), [Vmstat](https://linux.die.net/man/8/vmstat), [Monitorix](https://www.monitorix.org/), [Nethogs](https://github.com/raboof/nethogs), [Glance](https://nicolargo.github.io/glances/), [htop](https://htop.dev/) and [Netdata](https://www.netdata.cloud/). |
| 85 | + |
| 86 | + |
| 87 | +### Monitor Server Performance using `top` |
| 88 | + |
| 89 | +If we can see a server's processor activity in real-time, we are more likely to discover and diagonise any CPU and memory usage problems. The `top` command can assist with monitoring. |
| 90 | + |
| 91 | +In your server's terminal, run the command below: |
| 92 | + |
| 93 | +```python |
| 94 | +$ top |
| 95 | +``` |
| 96 | + |
| 97 | + |
| 98 | +This screen contains a variety of information regarding the server. |
| 99 | + |
| 100 | +```python |
| 101 | +top - 14:56:17 up 127 days, 22:19, 2 users, load average: 0.01, 0.01, 0.00 |
| 102 | +``` |
| 103 | + |
| 104 | +- The first line contains the **time, the uptime and load averages of the server**. The load average is displayed over 1, 5, and 15 minutes to provide a better overall look at the load my server has undertaken. |
| 105 | +- To properly read the load average, we need to know how many CPUs our Linode has. If there is 1 CPU, then a load average of 1.00 eans that the server is operating at its capacity. This number increases to 2 if the number of CPUs is 2, etc. |
| 106 | +- A load average of 0.70 for a Linode with 1 core is generally considered a threshold. Anything higher requires reconfiguration of resources or the need to upgrade. |
| 107 | + |
| 108 | + |
| 109 | +```python |
| 110 | +Tasks: 118 total, 1 running, 117 sleeping, 0 stopped, 0 zombie |
| 111 | +``` |
| 112 | + |
| 113 | +- The second line is a **list of tasks and their various states**. |
| 114 | + |
| 115 | +```python |
| 116 | +%Cpu(s): 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st |
| 117 | +``` |
| 118 | + |
| 119 | +- The third line is the **CPU percentages**: |
| 120 | + - user CPU time (`us`) |
| 121 | + - System CPU time (`sy`) |
| 122 | + - Nice time (`ni`) - time spend on low prioity processes |
| 123 | + - Idle time (`id`) |
| 124 | + - Time spent on wait I/O processes (`wa`) |
| 125 | + - Time handling hardware interruptions (`hi`) |
| 126 | + - Time handling software interruptions (`si`) |
| 127 | + - Steal time (`st`) - time stolen from the virtual machine |
| 128 | + |
| 129 | +- The forth line is the **server's memory usage** in kilobytes |
| 130 | + |
| 131 | +```python |
| 132 | +MiB Mem : 976.8 total, 97.0 free, 321.5 used, 558.3 buff/cache |
| 133 | +``` |
| 134 | + |
| 135 | +- The fifth line is the **server's swap usage** in kilobytes |
| 136 | + |
| 137 | +```python |
| 138 | +MiB Swap: 512.0 total, 423.0 free, 89.0 used. 497.1 avail Mem |
| 139 | +``` |
| 140 | + |
| 141 | +Thereafter, we have a heading with a list of processes and related data |
| 142 | + |
| 143 | + |
| 144 | + |
| 145 | +- **PID**: Process ID |
| 146 | +- **USER**: The username of the task owner |
| 147 | +- **PR**: The task priority from -20 - 19, with -20 being the most important |
| 148 | +- **NI**: The _nice value_ which augments the priority of a task. Negative values increase a task's priority while postive values decrease it. |
| 149 | +- **VIRT**: The virtual memory (both RAM and swap combined) used |
| 150 | +- **RES**: The resident non-swapped, physical memory in kilobytes (usually) |
| 151 | +- **SHR**: The shared memory size, or memory that could be allocated to other processes |
| 152 | +- **S**: The process status. `R` for running, `D` for sleeping and unable to be interrupted, `S` sleeping and able to be interrupted, `T` for traced/stopped and `Z` for zombie |
| 153 | +- **%CPU**: CPU percentage since the last `top` update |
| 154 | +- **%MEM**: Memory (RAM) percentage since the last `top` update |
| 155 | +- **TIME+**: Cumulative CPU time that the process and children processes have used |
| 156 | +- **COMMAND**: Name of process |
| 157 | + |
| 158 | + |
| 159 | +### `top` commands |
| 160 | + |
| 161 | +The `top` command can be used in conjunction with other commands both on the commandline or interactively. Important commandline options include: |
| 162 | + |
| 163 | +- `-d [interval]`: Sets the delay time that `tip` uses to refresh results |
| 164 | +- `-i`: Toggles whether or not the idle processes are shown |
| 165 | +- `-p [PID]`: Allows the user to filter `top` so only defined processes are shown |
| 166 | +- `-u [username]`: Filters by user |
| 167 | +- `-n [limit]`: Sets `top` to run for a set amount of intervals before exiting |
| 168 | +- `b`: Runs `top` in batch mode, which is ideal for log files and in conjunction with other programs |
| 169 | + |
| 170 | +```python |
| 171 | +$ top -b -p3304014 -d10 -n2 |
| 172 | + |
| 173 | +# Output |
| 174 | + |
| 175 | +top - 15:53:58 up 127 days, 23:16, 2 users, load average: 0.09, 0.02, 0.01 |
| 176 | +Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie |
| 177 | +%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st |
| 178 | +MiB Mem : 976.8 total, 93.7 free, 321.3 used, 561.8 buff/cache |
| 179 | +MiB Swap: 512.0 total, 423.0 free, 89.0 used. 497.0 avail Mem |
| 180 | + |
| 181 | + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
| 182 | +3304014 gitauha+ 20 0 90192 59052 7200 S 0.0 5.9 0:08.62 gunicorn |
| 183 | + |
| 184 | +top - 15:54:03 up 127 days, 23:16, 2 users, load average: 0.08, 0.02, 0.00 |
| 185 | +Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie |
| 186 | +%Cpu(s): 0.2 us, 0.0 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st |
| 187 | +MiB Mem : 976.8 total, 93.7 free, 321.3 used, 561.8 buff/cache |
| 188 | +MiB Swap: 512.0 total, 423.0 free, 89.0 used. 497.0 avail Mem |
| 189 | + |
| 190 | + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
| 191 | +3304014 gitauha+ 20 0 90192 59052 7200 S 0.0 5.9 0:08.62 gunicorn |
| 192 | + |
| 193 | +``` |
| 194 | + |
| 195 | +The above `top` command logs the process identified by the PID `3304014` after a delay of 10 seconds in 2 cycles. |
| 196 | + |
| 197 | +Interactively, we can issue the following commands in an active `top` session: |
| 198 | + |
| 199 | +- `return` or `space`: Instantly update the screen |
| 200 | +- `d` or `s`: Alter the delay time |
| 201 | +- `H`: Show individual threads for all processes |
| 202 | +- `i`: Toggles whether idle processes will be displayed |
| 203 | +- `U` or `u`: Filter the process by the owner's username |
| 204 | +- `k`: Kill a process. You will be prompted to enter the PID |
| 205 | +- `q`: Quit |
| 206 | + |
| 207 | + |
| 208 | +### Commands Similar to `top` |
| 209 | + |
| 210 | +There is [htop](http://hisham.hm/htop/), which is similar to `top`, but offers an easier interface with color, mouse operations, and horizontal and vertical scrolling, making it more intuitive. |
| 211 | + |
| 212 | +To use it, we first need to install it by running th command: |
| 213 | + |
| 214 | +```python |
| 215 | +$ sudo apt install htop |
| 216 | +``` |
| 217 | + |
| 218 | +Running is similar to `top`: |
| 219 | + |
| 220 | +```python |
| 221 | +$ htop |
| 222 | +``` |
| 223 | + |
| 224 | + |
| 225 | + |
| 226 | +You can use your mouse to scroll the interactive process viewer. You can click on a process using yoru mouse to highlight it then press `k`, for example, to kill it. At the bottom, you will notice a few buttons that you can click on. |
0 commit comments