Skip to content

Commit 99ea3bf

Browse files
committed
docs/debug: add paper about ahci issue in pfSense
1 parent 62e15d9 commit 99ea3bf

File tree

1 file changed

+159
-0
lines changed

1 file changed

+159
-0
lines changed

Diff for: docs/debug/pfsense-ahci-issue.md

+159
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
pfSense installation tests
2+
==========================
3+
4+
## Problem description
5+
6+
Apu boards with coreboot 4.6.x have problems with pfSense installation on hard
7+
disks and platform sometimes hangs running this system.
8+
9+
```
10+
ahcich1: Timeout on slot 4 port 0
11+
ahcich1: is 00000008 cs 00000000 ss 00000000 rs ffffffff tfd 40 serr 00000000 cmd 00406417
12+
(ada0:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 10 c0 d6 40 26 00 00 00 00 00
13+
(ada0:ahcich1:0:0:0): CAM status: Command timeout
14+
(ada0:ahcich1:0:0:0): Retrying command
15+
ahcich1: Timeout on slot 5 port 0
16+
ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00000020 tfd 50 serr 00000000 cmd 00406517
17+
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
18+
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
19+
(aprobe0:ahcich1:0:0:0): Retrying command
20+
ahcich1: Timeout on slot 6 port 0
21+
ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00000040 tfd 50 serr 00000000 cmd 00406617
22+
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
23+
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
24+
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
25+
ahcich1: Timeout on slot 7 port 0
26+
ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00000080 tfd 50 serr 00000000 cmd 00406717
27+
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
28+
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
29+
(aprobe0:ahcich1:0:0:0): Error 5, Retry was blocked
30+
ada0 at ahcich1 bus 0 scbus1 target 0 lun 0
31+
ada0: <ST1000LM014-SSHD-8GB LVD3> s/n W380YWQN detached
32+
```
33+
34+
After command timeout, the disk is being detached and installation stops.
35+
36+
37+
## Possible reasons
38+
39+
Community and tests tells that problem exist only in coreboot 4.6.x. Legacy
40+
version seems to be unaffected. After dumping the SATA controller registers at
41+
the end of ramstage for both coreboot 4.6.x and 4.0.x one can see slight
42+
differences in the content of registers. The main differences worth attention
43+
are:
44+
45+
1. Watch Dog Control And Status Register(PCI dev 11 fun 0 offset 0x44):
46+
47+
- Watchdog disabled in 4.6.x
48+
- Watchdog counter not set properly (reset state) in 4.6.x
49+
50+
2. PHY Core Control 2 Register (PCI dev 11 fun 0 offset 0x84):
51+
52+
- PHY PLL Dynamic Shutdown enabled (reset state) in 4.6.x (disabled in legacy)
53+
54+
3. HBA Capabilities Register (SATA Memory Mapped AHCI Registers offset 0x0):
55+
56+
- Command Completion Coalescing Supported bit set (reset state) in 4.6.x
57+
(disabled in legacy) - this bit is read-only so its state depends on AGESA
58+
59+
Even if these differences are eliminated, the problem still occurs. This may
60+
lead to a conclusion, that AGESA code part that was ported from `3rdparty/blobs`
61+
to `src/vendorcode` does not behave exactly as in legacy.
62+
63+
Checking the disk with `smartctl` command does not give any clue too:
64+
65+
```
66+
SMART Error Log Version: 1
67+
ATA Error Count: 1
68+
CR = Command Register [HEX]
69+
FR = Features Register [HEX]
70+
SC = Sector Count Register [HEX]
71+
SN = Sector Number Register [HEX]
72+
CL = Cylinder Low Register [HEX]
73+
CH = Cylinder High Register [HEX]
74+
DH = Device/Head Register [HEX]
75+
DC = Device Command Register [HEX]
76+
ER = Error register [HEX]
77+
ST = Status register [HEX]
78+
Powered_Up_Time is measured from power on, and printed as
79+
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
80+
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
81+
82+
Error 1 occurred at disk power-on lifetime: 3920 hours (163 days + 8 hours)
83+
When the command that caused the error occurred, the device was in an unknown state.
84+
85+
After command completion occurred, registers were:
86+
ER ST SC SN CL CH DH
87+
-- -- -- -- -- -- --
88+
04 71 00 03 00 00 40 Device Fault; Error: ABRT
89+
90+
Commands leading to the command that caused the error were:
91+
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
92+
-- -- -- -- -- -- -- -- ---------------- --------------------
93+
00 00 00 00 00 00 00 ff 01:56:32.276 NOP [Abort queued commands]
94+
00 00 00 00 00 00 00 ff 01:56:26.955 NOP [Abort queued commands]
95+
ea 00 00 00 00 00 a0 00 01:56:22.973 FLUSH CACHE EXT
96+
61 00 08 ff ff ff 4f 00 01:56:22.973 WRITE FPDMA QUEUED
97+
ea 00 00 00 00 00 a0 00 01:56:22.962 FLUSH CACHE EXT
98+
```
99+
100+
Digging in the FreeBSD forums gave me a hint that migration from kernel 10.x to
101+
11.x, which takes place between pfSense versions 2.3.x and 2.4.x, caused many
102+
problems with hard disk. There were major changes to AHCI and many users
103+
complained at the same issue mentioned in this paper. I have read that
104+
customizing the installation may solve this issue.
105+
106+
## Solution and tests
107+
108+
I have found many possible solutions on FreeBSd forums:
109+
110+
- change power saving policy for AHCI: `hint.ahcich.x.pm_level="y"`
111+
(x - channel, y - level [0-5])
112+
- disable ATA DMA `hint.ata.0.mode=PIO4`
113+
- disable Message Signaled Interrupts (MSI) for ATA `hint.ahci.x.msi="0"`
114+
(x - SATA controller)
115+
116+
I have tested few BIOS versions like 4.0.11, 4.0.14, 4.6.1, 4.6.4. I have used
117+
the SATA port available on port and Seagate HDD:
118+
119+
```
120+
Model Family: Seagate Laptop SSHD
121+
Device Model: ST1000LM014-SSHD-8GB
122+
Serial Number: W380YWQN
123+
LU WWN Device Id: 5 000c50 06e82fb73
124+
Firmware Version: LVD3
125+
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
126+
Sector Sizes: 512 bytes logical, 4096 bytes physical
127+
Rotation Rate: 5400 rpm
128+
Form Factor: 2.5 inches
129+
Device is: In smartctl database [for details use: -P show]
130+
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
131+
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
132+
Local Time is: Wed Feb 7 11:06:32 2018 GMT
133+
SMART support is: Available - device has SMART capability.
134+
SMART support is: Enabled
135+
```
136+
137+
138+
The 4.0.x versions did not need any modifications. After performing over 15
139+
installations no error occured.
140+
141+
Problems only appeared in 4.6.x versions.
142+
143+
|BIOS version|clean|PM level 0|DMA disabled|MSI disabled|
144+
|------------|-----|----------|------------|------------|
145+
| v4.6.1 | FAIL| FAIL | FAIL | PASS |
146+
| v4.6.4 | FAIL| FAIL | FAIL | PASS |
147+
148+
`PASS` - over 15 installations finished without errors
149+
150+
As the name of modification says, it is a hint for installer to not use such
151+
features. Tests show that when installer is not using MSI the installation goes
152+
without errors. In other cases installation fails after 0-5 good installations
153+
in a row.
154+
155+
I have found answers on FreeBSD forums that signal races occur and this leads
156+
to timeouts on disk operations. Disabling MSI seems to solve this problem.
157+
158+
The same solution can be utilized in the installed system. Appending
159+
`hint.ahci.0.msi="0"` to `/boot/loader.conf.local` should prevent system hang.

0 commit comments

Comments
 (0)