“I cannot help but speculate on how the software on the Curiosity rover has been constructed. We know that most of the code is written in C and that it comprises 2.5 Megalines of code, roughly. One may wonder why it is possible to write such a complex system and have it work. This is the Erlang programmers view.”
Having strict quality guidelines, and isolating systems as much as possible is great.
However, my concern is the recent “brain transplant”. Why are they doing a complete OTA software update right away? Isn’t it too risky?
I see it as somewhat less risky – for the landing (and post-landing checkout) they used a more specialised, more basic software.
Now, the rover will start using the more complex main mission software – one that was uploaded to it fairly recently, when the probe was en route, so presumably also with more time for debugging (plus, I imagine that software update procedures are among the best tested; and that they’re doing it one of two redundant computers at a time)
In another update: more & also colour pictures! http://commons.wikimedia.org/wiki/Category:Photos_by_the_Curiosity_… (and I guess this Commons page will have a decent selection, in the future – bookmarked)
That’s what you get when you release beta version as the point zero release and the stable version as the point one release…
Complete reloads of software on spacecraft have been done since the 70’s. While it can be dangerous, the benefits outweigh the risks.
Most likely they have several images of the code in NV memory that can be activated by a restart. At least one of the versions will be a “safe” version that has minimal functionality beyond communicating with earth, diagnostics and loading additional images.
The Voyager 2 spacecraft was reprogrammed more then a decade after launch because the flight SW was not designed to operate in the Uranus/Neptune environments. The tape drives needed to be used to “spin” the spacecraft during close approach to U & N to pan the camera so the image would not be blurred due to the high relative motion and the long exposure times needed in the low light level environment.
You’re right. In fact it was considered such a big deal that it was covered by the evening newscasts at the time in the US.
I don’t know how things are done in Curiosity but I know a bit about critical software in aircrafts.
– The operating systems are deliberately crippled and very static. Task scheduling is fixed with pre-defined deadlines. For hardcore stuff (flight controls for example), equipments are single-purpose and there is often no “real” operating system, just a scheduler.
The Arinc653 standard for operating systems is now more and more used for “modular avionics” : Multi-purpose onboard computers.
– Safety is usually obtained by using both redundancy and dissemblance. Redundancy by using several identical equipments for fault tolerance, dissemblance by selecting different components and software (including eventually different programming languages) to avoid systematic issues.
For example, Boeing “triple” B777 : Three primary computers based each on three dissimilar CPUs : intel 486, AMD 29k and Motorola 68040.
– The 2.5M LOC can be misleading as often C is not the original programming language (except for some special parts for system control, drivers) but instead an intermediate format generated by higher level tools (for example “SCADE, Matlab”). C compilers are often the most mature, and proven reliable compilers exists for subsets of the C langage. Optimisers are used with a lot of caution.
The auto-generated C code is (deliberately) very dumb and basic (no pointers, global variables at fixed memory addresses…)
I suppose also that for Curiosity there are plenty of backup mechanisms. The software is certainly very very complex but in most of the software, eventual bugs don’t compromise the mission and can be patched remotely.
[As an afterthought, I realise that this post is not that much related to the subject. Sorry !]
I was reading the JPL C coding guidelines the other week. While I can understand the rationales behind most of them, it did look very unpractical. Like for example the no recursion rule, as some very powerful data structures require recursion to be implemented efficiently (trees anyone?).
Luckily, I got to talk to someone who used to work on static code analysis on NASA code. Two things struck me:
– They don’t always follow their own guidelines. As they can lead to worse and buggier code (see the recursion example above).
– These guidelines are written by researchers and paper pusher. The actual coding is done by an entire different group of people. IMHO it shows in the guidelines, as they sound nice in theory but I can see so many problems with them in practice.
While NASA (and similar agencies) are always proud of their strict rules and rugged procedures to produce perfect code and spacecrafts, one has to remember that these are also part PR stunt and that reality is always a lot greyer.
It’s even more impractical if a coding error requires tons more rocket fuel and metal to fix.
Recursion cannot be checked statically.
You can emulate recursion iteratively with custom controlled stack. It will also be faster on RISC architectures. Recursion is dangerous (it could break through the limits of allocated stack for example)
As they can lead to worse and buggier code (see the recursion example above).
Bad example. Recursion (via recursive calls to itself) doesn’t work nice with modern hardware anyway.
Wound you mind elaborating on why you think NASA having struct rules & procedures is “part PR stunt”?
Now, we can’t know what he meant… but remember that NASA, at the very least for large part of its history, was (also!) about PR – that was basically the point of the Space Race after all.
http://en.wikipedia.org/wiki/Rogers_Commission_Report#Role_of_Richa… – and those observations shed some light on the organisational adherence to the professed values of NASA, and integrity of the rules themselves. Yes, 2+ decades ago, but… http://en.wikipedia.org/wiki/Columbia_Accident_Investigation_Board#… – and not long after, pushing for quite flawed Ares I (inherent vibrations of the “stick” and unlikelihood of successful escape system deployment – in the event of catastrophic failure, the parachuting descend would likely happen in a cloud of burning metal particles; generally, continuing with SRB – which got human-rated only by lowering the standards).
Even the whole STS programme (and not terminating it at a good opportunity in late 80s…) can be also seen as one big stunt – sure, it looked awesome (kinda like from… our dreams http://www.osnews.com/permalink?526126 ), absolutely, but… it didn’t really deliver on any of its goals, as originally advertised. Moreover, it was probably conceptually obsolete before even seriously getting on the drawing board (considering that autonomous, unmanned rendezvous and docking was done already in the 60s)
Other space agencies, kinda similar… even if they managed to be somewhat more sensible occasionally (like how the Russians pragmatically used, and continued using, the first operational ICBM http://en.wikipedia.org/wiki/File:GPN-2002-000184.png …and not nearly only for manned launches, eventually making it “the most reliable […] most frequently used launch vehicle in the world” http://www.esa.int/esaMI/Delta_Mission/SEM8XK57ESD_0.html – a century of service seems well within its grasp, considering just inaugurated launch complex in French Guiana), there were still brain-farts here and there… (why exactly did the Soviets run, in parallel throughout the 70s and 80s, three+ separate manned programmes, vehicles? In the end, only one ever actually launched with people on board; and that’s not the only such example, they didn’t have a centralised governing body in the style of NASA, there was often a lot of infighting between the bureaus – but a) that was a horrible way of doing those things b) still, the irony of the communists doing it competition-style is priceless ;p )
PS. Present (luckily not planned ones) US spacesuits are perhaps also an illustrative example… wth, from where did the thought come that went into their core design? A basic difference in concept & construction (entering into vs dressing in a spacesuit – treating it in the former example as what it is, a miniature spacecraft) means that while donning Orlan is a dozen or so minutes deal, with US suits it takes over an hour, for no real benefit (only greatly complicating procedures and likelihood of failure).
Edited 2012-08-20 00:15 UTC
“They Write the Right Stuff” http://www.fastcompany.com/28121/they-write-right-stuff
It’s a bit old, but still a good read, giving some insight.
On “Some of the traits of the Curiosity Rovers software closely resembles the architecture of Erlang. Are these traits basic for writing robust software?” – Well, I’d say it doesn’t resemble Erlang, but it does resemble good thinking and good engineering, which might have been the case with Erlang designers too (which I don’t know).
On the other hand: Megalines? Really? That’s not geeky, that’s stupid. What comes next, hectolines?
Should be mebisloc.
Megalines is ambiguous as to whether it’s a large number of lines, or just a number of huge lines.
This is indeed a good article.
There is the old divide between “hackers” and “engineers”.
Critical software is done by engineers : There is no very clever hack, software must be readable and modified for more 20 years, and everything is aboundantly documented and based on requirements which are later verified. The software must do exactly what it is expected to do, nothing more.
Of course, it is extremely expensive and it is certainly pointless to use this development method to all software, and it certainly stifles creativity.
The “creativity” happens in the engineering part where it belongs, not in the coding part which introduces too many bugs.
Though I do understand the benefits of the Erlang environment, the methodologies are anything new. Many of us have been following the “ideas” of Erlang in many languages so the tone in the article I find a little condescending.
“The operating system is VxWorks. This is a classic microkernel. A modest guess is that the kernel is less than 10 Kilolines of code and is quite battle tested. In other words, this kernel is near bug free. ”
This quote from the author reduces his credibility to near zero…
vxWorks is by no means bug free! We find them on a regular basis. If you look at the vxWorks kernel source code is looks surprisingly a lot like the code base for Emacs! About a 4-1 ratio of C code to #defines. As of about 2 years ago Windriver did not even have a R&D division… It has a lot of mileage so some of it is well tested but it’s still spaghetti code. Our kernels are about 3MiB for an embedded system. Not small by any standard except desktops.
Why is the CPU only 200MHz and manufactured using an outdated 150nm process?
Radiation hardening. Bear in mind, the Earth’s ionosphere and other parts of the atmosphere shield us from quite a bit of that, while out in space and/or on mars, a significant number of cosmic rays and such will be encountered. On hardware that’s not radiation-hardened, that will easily result in bits being flipped in memory and transistor states being changed, which will cause things to go wrong quite rapidly. Generally speaking, the smaller the manufacturing process/feature size, the more sensitive it is to that kind of manipulation, and there’s a significant added cost and testing needed for radiation hardening (plus relatively limited use cases), which is why it tends to lag a few generations behind the latest consumer products.
More than “bits being flipped in memory and transistor states being changed” – consumer-grade kinds of hardware would be likely quickly destroyed ( http://www.esa.int/SPECIALS/Space_Engineering/SEM6W34TBPG_0.html the picture in Radiation effects and environments section)
Now, the ISS does have ~100 Thinkpads onboard that seem to be doing quite well, but:
a) critical systems, in particular the main computers of the station (IIRC, using i386), are rad-hard nonetheless
b) I imagine that one of the reasons for an order of magnitude more laptops than humans is to always have a spare at hand
c) the station is still quite well protected in LEO also by our magnetosphere (hm, and I wouldn’t be surprised if the tenuous Martian atmosphere – without magnetosphere – makes things occasionally harder on the rovers, by changing singular high-energy particles into showers of them)
Maybe even most usage scenarios for rad-hard processors are quite content with the amount of processing power they already have? It kinda seems like that for most avionics; or for systems working in direct vicinity to nuclear reactors, particle accelerators, or medical usages of radiation (which generally tend to just transmit readouts, heavy processing done elsewhere).
So it looks that autonomous rovers, space science missions in general (already collecting such vast amounts of ~astronomical data that there’s no way to transmit it all – first stages of analysis must be done onboard), might represent even more limited use case.
(plus, the pre-launch preparations usually take something like a decade, can realistically use only what’s available a few years before launch)