-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Program full sectors, longer read/write timeouts #52
Conversation
Either the TinyProg-BootLoader state machines, or the SPI flash chip used on the TinyFPGA BX, seems unable to *properly* program a partial minor sector (ie, writing less than 256 octets), leading to a failed write/verify cycle. This seems to happen at the end of any programming that is not an exact multiple of 256 octets. Since the whole 256 octets is already erased in program_sectors(), we pad out the short write to a full 256 octets (with 0xff, which is the value read for "erased", to reduce wear on on the flash cells). For more detail on diagnosing this, see: timvideos/litex-buildenv#137
"pip-wheel-metadata" is created alongside setup.py, by "pip install ." (used for development). See pypa/pip#6213 for discussion of this clutter (issue currently open; might be moved to another location, eg build/pip-wheel-metadata or .pip-wheel-metadata, in a later version).
Double the read timeout to reduce risk of short reads causing verify errors. Substantially increase the write timeout to try to reduce the risk of an incomplete write, or a write being abandoned when it was nearly finished due to a minor SPI flash delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you start a write not on a sector boundary?
@mithro According to the flash chip documentation, with program sectors, it'll "wrap around" within the 256 octet (minor) sector (start of page 15 of the datasheet), ie that method only adjusts certain low address values. I think the way that program_sectors() is used (called by program_bitstream(), called by various places in perform_bootloader_update() and main()) probably means that it's always called with a starting address on a 256 octet boundary. But you're right, We should probably (a) check that the start address is actually a 256-byte sector boundary (and fail out at the beginning if it is not), and (b) adjust the padding calculation so it can't wrap around in that case. Ewen |
I've added both of those. I believe the only regularly used programming locations are 0x28000 (start of user image) and 0x00000 (start of flash, for the bootloader), but it'll now throw an exception if someone tries to program somewhere else that is legal but not a multiple of 256. I also changed the code to calculate the offset from the start of the 256 octet sector boundary in the case where it's going to try to do a short write, and avoid wrapping in that case. (But in practice I think the only way you could trigger that behaviour and get sane results would be to be writing less than 256 octets total at an offset that is not a multiple of 256; which is now no longer supported.) Ewen PS: The programming offsets are shown in hex, but I only realised after testing that PPS: The "you can't do that" stack trace is ugly, but in theory should "never happen", so I've left it as just a safety check.
|
Ah, it looks like I'll repush a version that just adjusts the padding to reach the end of the 256 octet offset but not overrun it, and only apply if the original start address was a multiple of the minor sector size. Ie, only apply in the case where we were having problems. Ewen |
To avoid wrap around when writing beyond the 256 octet sector, if the write starts part way into the 256 octet sector, only do the padding to a full 256 octet sector if we are writing from the begining of the sector (and thus writing a full 256 octets will not wrap around). We also write 0xff for safety, as in most SPI flash that is the erased value, and effectively one can only write 0 bits, so writing twice to a cell (without erasing) writes (existing value AND new value), which for a new value of 0xff should be the existing value.
6a81e58
to
0b573f3
Compare
Now the special padding work around will only happen if we're aligned, which should be safe. But we don't try to force programming at aligned addresses. (I'm really not convinced that programming starting at unaligned addresses via Still seems to solve the FIRWWARE=none problem. While testing this I noticed that our MicroPython image is actually not a multiple of 256 octets (it's 4 bytes longer), nor is our stub image, yet both of always worked (give or take timeouts):
So there's apparently something Yet More Magical (tm) about the 224 byte value of But: So "this seems to work in practice", and I think at this point it shouldn't break anything in practice. But if someone else wants to figure out why writing a 224 byte partial sector is the thing that fails (with weird data wrap around being observed -- see timvideos/litex-buildenv#137) then feel free. to do that instead. Ewen PS: Either way I think we should increase the read/write timeouts, which means if we're not going to merge this pull request we should probably cherry pick that change and the |
Joy. I was doing a final reprogram of MicroPython into the board before I put it away, and finally triggered a timeout. It seems there's something where even 5 seconds isn't enough and/or the bootloader/SPI state machine gets itself wedged.
It worked fine when I did another "make image-flash" to program it a second time. So there might well be value in extracting out the "try programming each sector multiple times" logic from #16 as well as these timeouts changes. Especially for BootLoader reprogramming (which unfortunately get recommended to basically every new user, as a bunch of TinyFPGA BX boards shipped with an incorrectly programmed bootloader :-( ) Ewen |
@ewenmcneill Maybe we should have very large timeouts on programming the bootloader? |
@mithro: That's a timeout on writing a single 256 octet minor sector! 5 seconds is a long timeout for writing 256 octets (bytes). Honestly even 1 second is a long timeout: that's roughly one octet ever 4ms. Even over 9600 bps serial that ought to be enough time, and it looks like the serial runs around 38400 bps in practice. 5 seconds/256 octets allows around 19ms per octet written. So I don't think it's the serial transfer itself that's the limiting factor. 5 seconds to write 256 octets is also a very long time in the context of writing to SPI flash. The write speed there is typically in the many hundreds of Kbps, if not Mbps (give or take how quickly they're clocked into the SPI flash). So I expect we could set, eg, a 30 second timeout and still hit it in the same conditions as the 5 second / 1 second write timeout. What still hitting a write timeout sometimes, even at 5 seconds, says to me is that there's a race condition in the USB / serial / TinyFPGA-Bootloader / SPI Flash interaction (and they're all basically chained together :-( ) which sometimes causes it to stop clocking through additional serial bytes. Causing the write to backlog indefinitely, and thus hit the really long timeout. No idea if that's on the USB/serial to BootLoader side or the BootLoader to SPI Flash side, or somewhere in between. Short of debugging that TinyFPGA-Bootloader to eliminate all possible race conditions (formal methods?) the next best option seems to be to retry to the sector write instead of immediately bailing out with "oh, it didn't work, hope that didn't brick your board", which is the traditional/existing Ewen |
I think retry is a good idea. My guess is the timeout could occur because of issue in things like USB forwarding when using a VM. |
FTR, I'm not using a VM. Ubuntu 16.04 LTS on Dell XPS9360, direct on the hardware. And still got the write timeout above (and a few others, with the same combination in the past). Same TinyFPGA BX, USB cable, USB port, laptop, etc, all work properly the other 95%+ of the time. Hence thinking there's an edge condition in the TinyFPGA Bootloader verilog (which AFAICT just got developed until it "almost always" worked). The retry logic is in #16. Feel free to pull out the relevant bits and rework that and merge it too. (I've have many, many, other things on my todo list that need doing, so if it waits for me to pull out/rework bits of #16 it may well be Some Time (tm).) Ewen |
FTR, this commit from Trammel Hudson (in a tree mentioned on Twitter) might be worth considering as well: osresearch@4367218. It skips erase/write on identical 4k sectors, and skips write/read back on 0xff filled 256-octet mini sectors (since 0xff is the erased value). Having merged this pull request it'll need a tiny bit of tweaking to insert in the right place, but the individual code blocks seem sensible to me. Ewen |
@ewenmcneill - I think that commit is merged now? |
It's not merged yet. But it is in #54, so if that gets merged we'll pick it up. (Otherwise I think we probably want that erase/write optimisation logic anyway.) Ewen |
@ewenmcneill - This pull request is merged now.... |
As diagnosed in timvideos/litex-buildenv#137, there seems to be an issue with programming the TinyFPGA BX SPI flash with less than 256 octets (minor sector size). In theory the SPI flash supports this (see datasheet pages 14-16), but in practice it always seems to fail. It's unclear whether this is caused by a state machine issue in the TinyFPGA-Bootloader verilog, or some bug or unclear documentation in the SPI flash. But since we've already erased the entire 256 octets (and on any subsequent write, we'll also erase the entire 256 octets) the simplest solution is to pad all minor sector writes out to a full 256 octets (with 0xff, which is what is read back after erase, as that hopefully saves one cycle of wear on the "unused" flash cells).
Also double the read timeouts (was 1.0 seconds, now 2.0 seconds), because the only way that pySerial will return a short read is if the readTimeout is hit, and there are multiple reports of users getting short reads and verify errors (and thus programming failures; some bricking their boards). And substantially increase the writeTimeout (was 1.0 seconds, now 5.0 seconds) because
tinyprog
gives up immediately if there is a writeTimeout, which could lead to a partially programmed board (possibly one that is bricked), so it seems worth being more patient. (IIRC there's another pull request around that retries the sector write if it times out/fails, but ISTR it's buried in a gigantic pull request that adds a whole other board, etc, and is against 6-12 month old code, so it is not easily merged directly.)Seems to work for me, both images that are a multiple of 256 octets, and images that are not a multiple of 256 octets. (See timvideos/litex-buildenv#137 for the testing to demonstrate there was a problem; I re-ran the tests in timvideos/litex-buildenv#137 (comment) to check this version was working.)
Programing output:
And the serial output (note that if you program
FIRMWARE=none
it just writes to the gateware section of the flash, which pretty much guarantees it'll boot whatever firmware was written previously, thus in my case it booted MicroPython; see further down for writing a different firmware and then re-writing MicroPython back again):