|
| 1 | +pfSense installation tests |
| 2 | +========================== |
| 3 | + |
| 4 | +## Problem description |
| 5 | + |
| 6 | +Apu boards with coreboot 4.6.x have problems with pfSense installation on hard |
| 7 | +disks and platform sometimes hangs running this system. |
| 8 | + |
| 9 | +``` |
| 10 | +ahcich1: Timeout on slot 4 port 0 |
| 11 | +ahcich1: is 00000008 cs 00000000 ss 00000000 rs ffffffff tfd 40 serr 00000000 cmd 00406417 |
| 12 | +(ada0:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 10 c0 d6 40 26 00 00 00 00 00 |
| 13 | +(ada0:ahcich1:0:0:0): CAM status: Command timeout |
| 14 | +(ada0:ahcich1:0:0:0): Retrying command |
| 15 | +ahcich1: Timeout on slot 5 port 0 |
| 16 | +ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00000020 tfd 50 serr 00000000 cmd 00406517 |
| 17 | +(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 |
| 18 | +(aprobe0:ahcich1:0:0:0): CAM status: Command timeout |
| 19 | +(aprobe0:ahcich1:0:0:0): Retrying command |
| 20 | +ahcich1: Timeout on slot 6 port 0 |
| 21 | +ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00000040 tfd 50 serr 00000000 cmd 00406617 |
| 22 | +(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 |
| 23 | +(aprobe0:ahcich1:0:0:0): CAM status: Command timeout |
| 24 | +(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted |
| 25 | +ahcich1: Timeout on slot 7 port 0 |
| 26 | +ahcich1: is 00000002 cs 00000000 ss 00000000 rs 00000080 tfd 50 serr 00000000 cmd 00406717 |
| 27 | +(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 |
| 28 | +(aprobe0:ahcich1:0:0:0): CAM status: Command timeout |
| 29 | +(aprobe0:ahcich1:0:0:0): Error 5, Retry was blocked |
| 30 | +ada0 at ahcich1 bus 0 scbus1 target 0 lun 0 |
| 31 | +ada0: <ST1000LM014-SSHD-8GB LVD3> s/n W380YWQN detached |
| 32 | +``` |
| 33 | + |
| 34 | +After command timeout, the disk is being detached and installation stops. |
| 35 | + |
| 36 | + |
| 37 | +## Possible reasons |
| 38 | + |
| 39 | +Community and tests tells that problem exist only in coreboot 4.6.x. Legacy |
| 40 | +version seems to be unaffected. After dumping the SATA controller registers at |
| 41 | +the end of ramstage for both coreboot 4.6.x and 4.0.x one can see slight |
| 42 | +differences in the content of registers. The main differences worth attention |
| 43 | +are: |
| 44 | + |
| 45 | +1. Watch Dog Control And Status Register(PCI dev 11 fun 0 offset 0x44): |
| 46 | + |
| 47 | + - Watchdog disabled in 4.6.x |
| 48 | + - Watchdog counter not set properly (reset state) in 4.6.x |
| 49 | + |
| 50 | +2. PHY Core Control 2 Register (PCI dev 11 fun 0 offset 0x84): |
| 51 | + |
| 52 | + - PHY PLL Dynamic Shutdown enabled (reset state) in 4.6.x (disabled in legacy) |
| 53 | + |
| 54 | +3. HBA Capabilities Register (SATA Memory Mapped AHCI Registers offset 0x0): |
| 55 | + |
| 56 | + - Command Completion Coalescing Supported bit set (reset state) in 4.6.x |
| 57 | + (disabled in legacy) - this bit is read-only so its state depends on AGESA |
| 58 | + |
| 59 | +Even if these differences are eliminated, the problem still occurs. This may |
| 60 | +lead to a conclusion, that AGESA code part that was ported from `3rdparty/blobs` |
| 61 | +to `src/vendorcode` does not behave exactly as in legacy. |
| 62 | + |
| 63 | +Checking the disk with `smartctl` command does not give any clue too: |
| 64 | + |
| 65 | +``` |
| 66 | +SMART Error Log Version: 1 |
| 67 | +ATA Error Count: 1 |
| 68 | + CR = Command Register [HEX] |
| 69 | + FR = Features Register [HEX] |
| 70 | + SC = Sector Count Register [HEX] |
| 71 | + SN = Sector Number Register [HEX] |
| 72 | + CL = Cylinder Low Register [HEX] |
| 73 | + CH = Cylinder High Register [HEX] |
| 74 | + DH = Device/Head Register [HEX] |
| 75 | + DC = Device Command Register [HEX] |
| 76 | + ER = Error register [HEX] |
| 77 | + ST = Status register [HEX] |
| 78 | +Powered_Up_Time is measured from power on, and printed as |
| 79 | +DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, |
| 80 | +SS=sec, and sss=millisec. It "wraps" after 49.710 days. |
| 81 | +
|
| 82 | +Error 1 occurred at disk power-on lifetime: 3920 hours (163 days + 8 hours) |
| 83 | + When the command that caused the error occurred, the device was in an unknown state. |
| 84 | +
|
| 85 | + After command completion occurred, registers were: |
| 86 | + ER ST SC SN CL CH DH |
| 87 | + -- -- -- -- -- -- -- |
| 88 | + 04 71 00 03 00 00 40 Device Fault; Error: ABRT |
| 89 | +
|
| 90 | + Commands leading to the command that caused the error were: |
| 91 | + CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name |
| 92 | + -- -- -- -- -- -- -- -- ---------------- -------------------- |
| 93 | + 00 00 00 00 00 00 00 ff 01:56:32.276 NOP [Abort queued commands] |
| 94 | + 00 00 00 00 00 00 00 ff 01:56:26.955 NOP [Abort queued commands] |
| 95 | + ea 00 00 00 00 00 a0 00 01:56:22.973 FLUSH CACHE EXT |
| 96 | + 61 00 08 ff ff ff 4f 00 01:56:22.973 WRITE FPDMA QUEUED |
| 97 | + ea 00 00 00 00 00 a0 00 01:56:22.962 FLUSH CACHE EXT |
| 98 | +``` |
| 99 | + |
| 100 | +Digging in the FreeBSD forums gave me a hint that migration from kernel 10.x to |
| 101 | +11.x, which takes place between pfSense versions 2.3.x and 2.4.x, caused many |
| 102 | +problems with hard disk. There were major changes to AHCI and many users |
| 103 | +complained at the same issue mentioned in this paper. I have read that |
| 104 | +customizing the installation may solve this issue. |
| 105 | + |
| 106 | +## Solution and tests |
| 107 | + |
| 108 | +I have found many possible solutions on FreeBSd forums: |
| 109 | + |
| 110 | +- change power saving policy for AHCI: `hint.ahcich.x.pm_level="y"` |
| 111 | + (x - channel, y - level [0-5]) |
| 112 | +- disable ATA DMA `hint.ata.0.mode=PIO4` |
| 113 | +- disable Message Signaled Interrupts (MSI) for ATA `hint.ahci.x.msi="0"` |
| 114 | + (x - SATA controller) |
| 115 | + |
| 116 | +I have tested few BIOS versions like 4.0.11, 4.0.14, 4.6.1, 4.6.4. I have used |
| 117 | +the SATA port available on port and Seagate HDD: |
| 118 | + |
| 119 | +``` |
| 120 | +Model Family: Seagate Laptop SSHD |
| 121 | +Device Model: ST1000LM014-SSHD-8GB |
| 122 | +Serial Number: W380YWQN |
| 123 | +LU WWN Device Id: 5 000c50 06e82fb73 |
| 124 | +Firmware Version: LVD3 |
| 125 | +User Capacity: 1,000,204,886,016 bytes [1.00 TB] |
| 126 | +Sector Sizes: 512 bytes logical, 4096 bytes physical |
| 127 | +Rotation Rate: 5400 rpm |
| 128 | +Form Factor: 2.5 inches |
| 129 | +Device is: In smartctl database [for details use: -P show] |
| 130 | +ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b |
| 131 | +SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) |
| 132 | +Local Time is: Wed Feb 7 11:06:32 2018 GMT |
| 133 | +SMART support is: Available - device has SMART capability. |
| 134 | +SMART support is: Enabled |
| 135 | +``` |
| 136 | + |
| 137 | + |
| 138 | +The 4.0.x versions did not need any modifications. After performing over 15 |
| 139 | +installations no error occured. |
| 140 | + |
| 141 | +Problems only appeared in 4.6.x versions. |
| 142 | + |
| 143 | +|BIOS version|clean|PM level 0|DMA disabled|MSI disabled| |
| 144 | +|------------|-----|----------|------------|------------| |
| 145 | +| v4.6.1 | FAIL| FAIL | FAIL | PASS | |
| 146 | +| v4.6.4 | FAIL| FAIL | FAIL | PASS | |
| 147 | + |
| 148 | +`PASS` - over 15 installations finished without errors |
| 149 | + |
| 150 | +As the name of modification says, it is a hint for installer to not use such |
| 151 | +features. Tests show that when installer is not using MSI the installation goes |
| 152 | +without errors. In other cases installation fails after 0-5 good installations |
| 153 | +in a row. |
| 154 | + |
| 155 | +I have found answers on FreeBSD forums that signal races occur and this leads |
| 156 | +to timeouts on disk operations. Disabling MSI seems to solve this problem. |
| 157 | + |
| 158 | +The same solution can be utilized in the installed system. Appending |
| 159 | +`hint.ahci.0.msi="0"` to `/boot/loader.conf.local` should prevent system hang. |
0 commit comments