Skip to content

Commit 12c3ba3

Browse files
Update debug, perf, trace and page_cache
Fix issue: freelancer-leon#2 Signed-off-by: Leon <[email protected]>
1 parent d07a462 commit 12c3ba3

File tree

4 files changed

+280
-22
lines changed

4 files changed

+280
-22
lines changed

kernel/debug/debug.md

+131-7
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,56 @@
11
# 调试
22

3+
# 常用调试选项
4+
## lib/Kconfig.debug
5+
```c
6+
config DEBUG_INFO
7+
bool "Compile the kernel with debug info"
8+
depends on DEBUG_KERNEL && !COMPILE_TEST
9+
help
10+
If you say Y here the resulting kernel image will include
11+
debugging info resulting in a larger kernel image.
12+
This adds debug symbols to the kernel and modules (gcc -g), and
13+
is needed if you intend to use kernel crashdump or binary object
14+
tools like crash, kgdb, LKCD, gdb, etc on the kernel.
15+
Say Y here only if you plan to debug the kernel.
16+
17+
If unsure, say N.
18+
config DEBUG_KERNEL
19+
bool "Kernel debugging"
20+
help
21+
Say Y here if you are developing drivers or trying to debug and
22+
identify kernel problems.
23+
24+
```
25+
* 不开启`COINFIG_DEBUG_INFO`无法在反汇编`vmlinux`时内联 C 源代码
26+
## init/Kconfig
27+
```c
28+
config KALLSYMS
29+
bool "Load all symbols for debugging/ksymoops" if EXPERT
30+
default y
31+
help
32+
Say Y here to let the kernel print out symbolic crash information and
33+
symbolic stack backtraces. This increases the size of the kernel
34+
somewhat, as all symbols have to be loaded into the kernel image.
35+
36+
config KALLSYMS_ALL
37+
bool "Include all symbols in kallsyms"
38+
depends on DEBUG_KERNEL && KALLSYMS
39+
help
40+
Normally kallsyms only contains the symbols of functions for nicer
41+
OOPS messages and backtraces (i.e., symbols from the text and inittext
42+
sections). This is sufficient for most cases. And only in very rare
43+
cases (e.g., when a debugger is used) all symbols are required (e.g.,
44+
names of variables from the data sections, etc).
45+
46+
This option makes sure that all symbols are loaded into the kernel
47+
image (i.e., symbols from all sections) in cost of increased kernel
48+
size (depending on the kernel configuration, it may be 300KiB or
49+
something like this).
50+
51+
Say N unless you really need all symbols.
52+
```
53+
354
# 打印(print)
455

556
## printk
@@ -13,7 +64,7 @@
1364
Append ``",keep"`` to not disable it when the real console takes over.
1465
```
1566

16-
```
67+
```c
1768
keep_bootcon [KNL]
1869
Do not unregister boot console at start. This is only
1970
useful for debugging when something happens in the window
@@ -28,7 +79,7 @@
2879
printk(KERN_DEBUG “error=%d\n”, error);
2980
```
3081
* 调整
31-
```
82+
```sh
3283
echo 5 > /proc/sys/kernel/printk_ratelimit # Wait sec before re-open printk
3384
echo 10 > /proc/sys/kernel/printk_ratelimit_burst # Message number before limiting the rate
3485
```
@@ -37,7 +88,7 @@
3788

3889
### loglevel
3990

40-
```
91+
```c
4192
loglevel= All Kernel Messages with a loglevel smaller than the
4293
console loglevel will be printed to the console. It can
4394
also be changed with klogd or other programs. The
@@ -141,7 +192,7 @@ static int __init loglevel(char *str)
141192
early_param("loglevel", loglevel);
142193
...*```
143194
```
144-
### 调整
195+
### 调整 log level
145196
146197
* 如果不指定 printk 打印的 log level,那么它缺省的级别是 `DEFAULT_MESSAGE_LOGLEVEL`(通常 "4"=KERN_WARNING)
147198
* 该缺省值可以通过 `CONFIG_DEFAULT_MESSAGE_LOGLEVEL` kernel config 选项 (make menuconfig-> Kernel Hacking -> Default message log level) 调整。
@@ -158,7 +209,7 @@ early_param("loglevel", loglevel);
158209
# echo 8 > /proc/sys/kernel/printk
159210
```
160211
### logbuffer
161-
```
212+
```c
162213
log_buf_len=n[KMG] Sets the size of the printk ring buffer,
163214
in bytes. n must be a power of two and greater
164215
than the minimal size. The minimal size is defined
@@ -194,7 +245,7 @@ log_buf_len=n[KMG] Sets the size of the printk ring buffer,
194245
* sysrq 的键映射表见 drivers/tty/sysrq.c 的`struct sysrq_key_op *sysrq_key_table[]`数组
195246

196247
## git二分法查找
197-
```
248+
```sh
198249
git bisect start
199250
git bisect bad [revision bug on]
200251
git bisect good [revision no bug]
@@ -209,7 +260,80 @@ git bisect start - arch/x86
209260
```
210261

211262
# Dynamic Debug
212-
* CONFIG_DYNAMIC_DEBUG
263+
* `CONFIG_DYNAMIC_DEBUG` (lib/Kconfig.debug)
264+
```c
265+
config DYNAMIC_DEBUG
266+
bool "Enable dynamic printk() support"
267+
default n
268+
depends on PRINTK
269+
depends on (DEBUG_FS || PROC_FS)
270+
select DYNAMIC_DEBUG_CORE
271+
help
272+
273+
Compiles debug level messages into the kernel, which would not
274+
otherwise be available at runtime. These messages can then be
275+
enabled/disabled based on various levels of scope - per source file,
276+
function, module, format string, and line number. This mechanism
277+
implicitly compiles in all pr_debug() and dev_dbg() calls, which
278+
enlarges the kernel text size by about 2%.
279+
280+
If a source file is compiled with DEBUG flag set, any
281+
pr_debug() calls in it are enabled by default, but can be
282+
disabled at runtime as below. Note that DEBUG flag is
283+
turned on by many CONFIG_*DEBUG* options.
284+
285+
Usage:
286+
287+
Dynamic debugging is controlled via the 'dynamic_debug/control' file,
288+
which is contained in the 'debugfs' filesystem or procfs.
289+
Thus, the debugfs or procfs filesystem must first be mounted before
290+
making use of this feature.
291+
We refer the control file as: <debugfs>/dynamic_debug/control. This
292+
file contains a list of the debug statements that can be enabled. The
293+
format for each line of the file is:
294+
295+
filename:lineno [module]function flags format
296+
297+
filename : source file of the debug statement
298+
lineno : line number of the debug statement
299+
module : module that contains the debug statement
300+
function : function that contains the debug statement
301+
flags : '=p' means the line is turned 'on' for printing
302+
format : the format used for the debug statement
303+
304+
From a live system:
305+
306+
nullarbor:~ # cat <debugfs>/dynamic_debug/control
307+
# filename:lineno [module]function flags format
308+
fs/aio.c:222 [aio]__put_ioctx =_ "__put_ioctx:\040freeing\040%p\012"
309+
fs/aio.c:248 [aio]ioctx_alloc =_ "ENOMEM:\040nr_events\040too\040high\012"
310+
fs/aio.c:1770 [aio]sys_io_cancel =_ "calling\040cancel\012"
311+
312+
Example usage:
313+
314+
// enable the message at line 1603 of file svcsock.c
315+
nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' >
316+
<debugfs>/dynamic_debug/control
317+
318+
// enable all the messages in file svcsock.c
319+
nullarbor:~ # echo -n 'file svcsock.c +p' >
320+
<debugfs>/dynamic_debug/control
321+
322+
// enable all the messages in the NFS server module
323+
nullarbor:~ # echo -n 'module nfsd +p' >
324+
<debugfs>/dynamic_debug/control
325+
326+
// enable all 12 messages in the function svc_process()
327+
nullarbor:~ # echo -n 'func svc_process +p' >
328+
<debugfs>/dynamic_debug/control
329+
330+
// disable all 12 messages in the function svc_process()
331+
nullarbor:~ # echo -n 'func svc_process -p' >
332+
<debugfs>/dynamic_debug/control
333+
334+
See Documentation/admin-guide/dynamic-debug-howto.rst for additional
335+
information.
336+
```
213337
214338
# References
215339
* [Debugging by printing](https://elinux.org/Debugging_by_printing)

kernel/page_cache.md

+88-14
Original file line numberDiff line numberDiff line change
@@ -61,22 +61,22 @@
6161
* include/linux/fs.h
6262
```c
6363
struct address_space {
64-
struct inode *host; /* owner: inode, block_device */
64+
struct inode *host; /* owner: inode, block_device */
6565
struct radix_tree_root page_tree; /* radix tree of all pages */
66-
spinlock_t tree_lock; /* and lock protecting it */
67-
atomic_t i_mmap_writable;/* count VM_SHARED mappings */
68-
struct rb_root i_mmap; /* tree of private and shared mappings */
66+
spinlock_t tree_lock; /* and lock protecting it */
67+
atomic_t i_mmap_writable;/* count VM_SHARED mappings */
68+
struct rb_root i_mmap; /* tree of private and shared mappings */
6969
struct rw_semaphore i_mmap_rwsem; /* protect tree, count, list */
7070
/* Protected by tree_lock together with the radix tree */
71-
unsigned long nrpages; /* number of total pages */
71+
unsigned long nrpages; /* number of total pages */
7272
/* number of shadow or DAX exceptional entries */
7373
unsigned long nrexceptional;
74-
pgoff_t writeback_index;/* writeback starts here */
74+
pgoff_t writeback_index;/* writeback starts here */
7575
const struct address_space_operations *a_ops; /* methods */
76-
unsigned long flags; /* error bits/gfp mask */
77-
spinlock_t private_lock; /* for use by the address_space */
76+
unsigned long flags; /* error bits/gfp mask */
77+
spinlock_t private_lock; /* for use by the address_space */
7878
struct list_head private_list; /* ditto */
79-
void *private_data; /* ditto */
79+
void *private_data; /* ditto */
8080
} __attribute__((aligned(sizeof(long))));
8181
```
8282
* `i_mmap` 一个优先搜索树,它的搜索范围包含了在`address_space`中所有共享的 *私有的* 映射页面。
@@ -150,11 +150,83 @@ struct address_space_operations {
150150

151151
![page_cache_file_mm_mapping.png](pic/page_cache_file_mm_mapping.png)
152152

153-
## Radix Tree
154-
155-
* 因为任何页I/O操作前内核都要检查page是否已在page cache中,内核通过 **radix tree** 数据结构来快速检索希望得到的page。
156-
* 每个`struct address_space`对象的都有唯一的radix tree。
157-
* radix tree即`struct radix_tree_root page_tree`域。
153+
## Radix Tree / Xarray
154+
* 因为任何页 I/O 操作前内核都要检查 page 是否已在 page cache 中,内核通过 **radix tree** 数据结构来快速检索希望得到的 page。
155+
* 每个`struct address_space`对象的都有唯一的 radix tree。
156+
* radix tree 即`struct radix_tree_root page_tree`域。
157+
* 在 v4.20 后,radix tree 已被改为 xarray。
158+
* include/linux/xarray.h
159+
```c
160+
/*
161+
* @count is the count of every non-NULL element in the ->slots array
162+
* whether that is a value entry, a retry entry, a user pointer,
163+
* a sibling entry or a pointer to the next level of the tree.
164+
* @nr_values is the count of every element in ->slots which is
165+
* either a value entry or a sibling of a value entry.
166+
*/
167+
struct xa_node {
168+
unsigned char shift; /* Bits remaining in each slot */
169+
unsigned char offset; /* Slot offset in parent */
170+
unsigned char count; /* Total entry count */
171+
unsigned char nr_values; /* Value entry count */
172+
struct xa_node __rcu *parent; /* NULL at top of tree */
173+
struct xarray *array; /* The array we belong to */
174+
union {
175+
struct list_head private_list; /* For tree user */
176+
struct rcu_head rcu_head; /* Used when freeing node */
177+
};
178+
void __rcu *slots[XA_CHUNK_SIZE];
179+
union {
180+
unsigned long tags[XA_MAX_MARKS][XA_MARK_LONGS];
181+
unsigned long marks[XA_MAX_MARKS][XA_MARK_LONGS];
182+
};
183+
};
184+
```
185+
* `shift` 当前节点的单位
186+
* slot 里存储的指针的最后两位指示该 node 的类型
187+
* `00`:node 为数据指针(叶子节点)
188+
* `10`:node 指向下一级 node,老版本为`01`
189+
* 老版本为`10`:node 为 exceptional entry,用于存储 shmem/tmpfs 的 swap entries
190+
* include/linux/radix-tree.h
191+
```c
192+
/*
193+
* The bottom two bits of the slot determine how the remaining bits in the
194+
* slot are interpreted:
195+
*
196+
* 00 - data pointer
197+
* 10 - internal entry
198+
* x1 - value entry
199+
*
200+
* The internal entry may be a pointer to the next level in the tree, a
201+
* sibling entry, or an indicator that the entry in this slot has been moved
202+
* to another location in the tree and the lookup should be restarted. While
203+
* NULL fits the 'data pointer' pattern, it means that there is no entry in
204+
* the tree for this index (no matter what level of the tree it is found at).
205+
* This means that storing a NULL entry in the tree is the same as deleting
206+
* the entry from the tree.
207+
*/
208+
#define RADIX_TREE_ENTRY_MASK 3UL
209+
#define RADIX_TREE_INTERNAL_NODE 2UL
210+
211+
static inline bool radix_tree_is_internal_node(void *ptr)
212+
{
213+
return ((unsigned long)ptr & RADIX_TREE_ENTRY_MASK) ==
214+
RADIX_TREE_INTERNAL_NODE;
215+
}
216+
```
217+
* 这样在使用 node 前需要用函数`entry_to_node()`转换一下,存储 entry 前需要做相反的操作
218+
* lib/radix-tree.c
219+
```c
220+
static inline struct radix_tree_node *entry_to_node(void *ptr)
221+
{
222+
return (void *)((unsigned long)ptr & ~RADIX_TREE_INTERNAL_NODE);
223+
}
224+
225+
static inline void *node_to_entry(void *ptr)
226+
{
227+
return (void *)((unsigned long)ptr | RADIX_TREE_INTERNAL_NODE);
228+
}
229+
```
158230
159231
# The Buffer Cache
160232
* 独立的磁盘块通过 block I/O buffer 也要被存入 page cache 中。
@@ -223,3 +295,5 @@ laptop_mode | A Boolean value controlling laptop mode. See the following section
223295
* [Linux Storage Cache](https://msreekan.com/2015/04/24/linux-storage-cache/)
224296
* [How Linux Kernel Manages Application Memory](http://techblog.cloudperf.net/2016/07/how-linux-kernel-manages-application_18.html)
225297
* [FREE命令显示的BUFFERS与CACHED的区别](http://linuxperf.com/?p=32)
298+
* [The XArray data structure - LWN.net](https://lwn.net/Articles/745073/)
299+
* [详解Linux内核Radix树算法的实现](http://sourcelink.top/2019/09/26/linux-kernel-radix-tree-analysis/)

kernel/profiling/perf.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -455,7 +455,7 @@ SYSCALL_DEFINE5(perf_event_open,
455455
* 运行时间越 **多** 的函数,被中断击中的机会越 **大**,从而推测,那个函数(或者`pid`等)的CPU占用率就越 **高**
456456
* 如果某个进程/函数运气特别好,它每次都刚好躲过你发起探测的位置,则统计结果可能就完全是错的。这是所有采样统计都有可能遇到的问题。
457457
* 如何可以获得更精确的采样结果
458-
* 延长采样的事件以获得更多的样本
458+
* 延长采样的时间以获得更多的样本
459459
* 提高采样的频率,然而对系统而言也会有额外的开销
460460
461461
### 基于时间的性能分析受 CPU idle 的限制

kernel/trace/ftrace-design.md

+60
Original file line numberDiff line numberDiff line change
@@ -1244,6 +1244,66 @@ static __init int init_function_trace(void)
12441244
core_initcall(init_function_trace);
12451245
```
12461246

1247+
## 栈跟踪
1248+
* 核心函数是`__ftrace_trace_stack()``__trace_stack()`对其有一个简单的封装
1249+
1250+
### func_stack_trace (function tracer)
1251+
```c
1252+
kernel/trace/trace_functions.c
1253+
function_stack_trace_call()
1254+
kernel/trace/trace.c
1255+
-> trace_function()
1256+
-> __trace_stack()
1257+
-> __ftrace_trace_stack()
1258+
kernel/stacktrace.c
1259+
-> stack_trace_save()
1260+
stack_trace_consume_fn consume_entry = stack_trace_consume_entry;
1261+
arch/x86/kernel/stacktrace.c
1262+
-> arch_stack_walk(consume_entry, &c, current, NULL)
1263+
-> unwind_start()
1264+
-> unwind_done()
1265+
-> unwind_next_frame()
1266+
-> unwind_get_return_address()
1267+
-> consume_entry() => stack_trace_consume_entry()
1268+
```
1269+
### stacktrace (tracepoint)
1270+
```c
1271+
kernel/trace/trace_events_trigger.c
1272+
stacktrace_trigger()
1273+
kernel/trace/trace.c
1274+
-> trace_dump_stack(STACK_SKIP)
1275+
-> __ftrace_trace_stack()
1276+
```
1277+
### function-trace (latency tracer)
1278+
```c
1279+
kernel/trace/trace_irqsoff.c
1280+
tracer_hardirqs_on()/tracer_preempt_on()
1281+
-> stop_critical_timings()
1282+
-> check_critical_timing()
1283+
-> __trace_function()
1284+
-> __trace_stack()
1285+
```
1286+
1287+
### ftrace_stacks[4] 和 ftrace_stack[1024] 数组
1288+
```c
1289+
/* Allow 4 levels of nesting: normal, softirq, irq, NMI */
1290+
#define FTRACE_KSTACK_NESTING 4
1291+
1292+
#define FTRACE_KSTACK_ENTRIES (PAGE_SIZE / FTRACE_KSTACK_NESTING)
1293+
/*每一级 1024 个条目。对于 64 位系统,占 8192 Byte (2 pages)*/
1294+
struct ftrace_stack {
1295+
unsigned long calls[FTRACE_KSTACK_ENTRIES];
1296+
};
1297+
/*每个条目存一个调用函数的地址,故而用的类型是 unsigned long*/
1298+
/*允许 4 级嵌套:normal, softirq, irq, NMI,2*4=8个pages*/
1299+
struct ftrace_stacks {
1300+
struct ftrace_stack stacks[FTRACE_KSTACK_NESTING];
1301+
};
1302+
/*Per-CPU 变量 ftrace_stacks 占 CPU x 8 个 pages*/
1303+
static DEFINE_PER_CPU(struct ftrace_stacks, ftrace_stacks);
1304+
static DEFINE_PER_CPU(int, ftrace_stack_reserve);
1305+
```
1306+
12471307
# Reference
12481308
12491309
* [ftrace 简介](https://www.ibm.com/developerworks/cn/linux/l-cn-ftrace/)

0 commit comments

Comments
 (0)