Google can’t seem to catch a break when it comes to Chrome OS 91. First we saw many users reporting their devices using an egregious amount of CPU after upgrading to 91.0.4472.147. While Google pulled the update shortly thereafter and rolled everyone back to 91.0.4472.114, that managed to lock out Linux apps. Now we’re seeing the arrival of 91.0.4772.165, and this update introduces an awful bug that’s breaking Chromebooks left and right.
[…]So what happened? Thanks to the work of an eagle-eyed user on Reddit, we now know that a single typo appears responsible for locking so many users out of their Chromebooks. By looking at the diff in this file, we can see that Google forgot to add a second “&” to the conditional statement, preventing Chrome OS from decrypting your login information (required to log you in).
This kind of sloppiness is what you get in an industry where there really aren’t any consequences to speak of for screwing things up. It’s not like software development is a real industry with strict product safety laws or anything.
“unlawful sexual harassment;”
If there is such a thing as “lawful” sexual harassment. This is disgusting. In the company where I work, you would be barred automatically if you are engaged in any improper behavior and will be terminated. In addition, a case will be filed against you. This is in the Philippines. However, there are still cases of abuses in our country of course, because companies have certain freedoms to impose disciplines inappropriately and have tornado-bosses.
[s] This should not be happening in the so-called “civilized” world that is the western world. [/sarcasm]
You should post that here : https://www.osnews.com/story/133738/california-sues-activision-blizzard-over-a-culture-of-constant-sexual-harassment/#comments
Thom, you cannot imagine how widespread the practice of not testing your software is in the industry. The pressure to fast release paced with “agile” development cycles makes it coders don’t have time anymore to spend the required time to test stress their own production. Plus the arrogance of working at high position makes them believe they already got to a point “if it compiles it works”. Sure the software tooling improved, the frameworks helps, but as you see, a simple typo can break things easily. It’s not like we already experienced these kind of failure in the spatial industry (reusing Ariane 4 software in Ariane 5 without upgrading the floating point precision, or Mars probe crashing because of conversion between imperial and metric).
There are common practices and methodologies for the past 50 years like the good ol’ V cycle, test tooling like CI/CD with testing pipelines executing a broad range of automated testing on the code submitted, but no, why bother when you are already at Chrome OS 91 (New badge at 100, achievement unlocked).
Fast release cycles that are the norm now means very little testing can be done before code is pushed out to users. This is inevitably causing declining quality of software and users seeing show-stopping bugs more often. I’m not sure if this is true generally, but at the places I have worked there is not enough people doing testing or QA, and even if there was there is not enough time for it to be done properly, and bugs fixed properly, before software gets shipped.
It would shock me if CI/CD weren’t in place for ChromeOS, but CI/CD isn’t a silver bullet to catch all bugs, and CI/CD isn’t a replacement for actual human QA engineers. This was an update that was going to be released on physical hardware, so it should have gone through at least a basic QA cycle. The developers should have generated a list of the areas of the system that were touched, and QA engineers should have focused their testing efforts on those areas.
Actually agile development practices prescribe the use of test driven development. This means no code can enter production without a test to verify it does what it should do. Clearly Google is *not* using agile development practices here.
Z_God
They are but it’s a variation called “fragile” 🙂
On a serious note, this stuff happens regardless of how big a company you are. It happens with google, apple, microsoft, oracle, banks, airlines, etc.
https://mjtsai.com/blog/2020/11/13/apple-server-outage-makes-mac-apps-hang-on-launch/
https://www.cnn.com/2021/07/22/tech/website-outages-akamai-oracle-disruptions/index.html
https://www.cbsnews.com/news/online-outage-major-airlines-australian-banks-brief/
I think the lesson to be learned is don’t trust everything to a monoculture solution provider. It doesn’t matter that they claim all their updates & cloud goodness is safe & reliable, you need to have a backup otherwise you’re setting yourself up for murphy’s law. I know the consumers & end users don’t know any better, but the project managers should have. Having all one’s eggs in one basket is irresponsible in tech, yet I don’t think many are taking this seriously. It’s like society can’t help but recreate the conditions that result in companies that are “too big to fail”.
No Google is not using agile, but code is always tested before it goes to production.
There is a nice public overview of the internal Google coding repository on this talk:
https://www.youtube.com/watch?v=W71BTkUbdqE
That being said, there are also -dev, and -beta channels or Chrome. And in theory all codes need to pass through them before reaching the widespread -stable distribution.
Obviously something failed along the road. It could be a small oversight, or a technical error. I can only speculate, but can also hope there will be updates to release mechanisms, so that the same exact thing will not happen in the future.
You’d have thought it would be regulated after the Therac-25 incidents
https://en.wikipedia.org/wiki/Therac-25
Seems mankind will never learn.
I think it goes deeper than that. It isn’t just coding and testing (or, as the case may be, not-testing) practices…
I’m a retired computer programmer, over 30 years experience. The single most common cause of errors, in my experience, is mistyped variable names.. As such, languages that make explicit variable declarations mandatory (you can still mistype a variable name, but you have to mistype it the same way every time you use it or the compiler throws an error) are a godsend… strongly typed variables are a really helpful anti-error mechanism too…
… and the trend is clearly toward languages where explicit variable declarations and strongly typed variables are not even possible.
Don Edwards,
I agree the language helps, but I don’t know that I’d personally rank variable names as major source of bugs.. Granted it might be easy to type in the wrong name, but the code generally won’t compile at all for compiled languages. And for scripted code, there’s often a strict mode “use strict” or “option explicit”, etc. A bad name should be caught in testing, if anyone is actually testing the code, haha.
IMHO race conditions, pointer, and range checking bugs are a major source of bugs since they can easily go unnoticed during testing. The result can be intermittent crashes and vulnerabilities in production. I’ve even seen times when adding debug symbols coincidentally causes the crashes to stop. I find these kinds of intermittent bugs to be the most frustrating.
The more verification that can be handed off to the compiler the better. Humans are fallible. Even when we understand the program and the kinds of bugs that can happen, our ability to verify logic rules gets worse as complexity increases. IMHO we should be encouraging the use of safe languages because humans simply cannot beat computers at consistency.
Let’s not forget that using ancient tools and programming languages that even allow programmers to make such a mistake also attributes to this. People make mistakes, it’s unavoidable. Especially such a trivial mistake should be detected by the tooling and give sound warnings or, better, errors.
Obviously this error should never have made it out to users, but after 3+ decades in software development, there’s absolutely no doubt in my mind that software tooling, procedures and overall quality have improved, and continue to improve.
You can certainly argue that this improvement should have happened faster. I’d agree that both loose regulation and companies trying to gain competitive advantage through skipping good hygiene practices have played a part in this. But the trend in my view has been and remains a positive one irrespective.
The easiest way to make a dumb mistake is in an attempt to fix a hugely impactful production issue caused by a previous dumb mistake. The more mistakes you make in trying to solve an issue in production, the dumber you become and the likelihood of making more increases due to embarrassment and plain adrenaline pumping through your veins.
This again. I can’t believe paid software is delivered “AS IS WITHOUT WARRANTY”. This means that, as things become increasingly computerized, nothing will have a warranty. For example, they can already sell you a SmartTV or thermostat that fails to boot or fails to deliver some major piece of promised functionality and it can be that way for months or even forever without them even having to refund you. Any refunds happen voluntarily (for cases where things fail to boot). Expect this to expand to kitchens, fridges, washing machines etc. Along with the relevant shrinkwrap contract of course.
Now, don’t get me wrong, I don’t expect commercial software to be theoretically proven in Event-B or anything (which is a requirement for things like railway software), just like I don’t expect car manufacturers to theoretically prove the lifespan of their products (which is required in the space industry for example), but dammit, software should come with some kind of a warranty and vendors should have some degree of liability, just like car manufacturers do.
Of course Google has recognized the problem with the short release cycle and have promised to fix them by making the release cycle even shorter, so not only can’t the fix any thing, they wont have enough time to break stuff either.
I did a search on “Chrome OS update pulled” over a receding date range from 2010 and this it is a relatively rare event, albeit it became every year starting in 2018. Then you look at the same search for Windows …….
The issue is that software is fundamentally brittle. The behavior of the code or config file with the missing & is way more than 1/total-characters different from the one with the &. Maybe we need to be able to do “diff” at a behavior vs character level. I am thinking of a Tesla that re-mixes go and stop for red and green. That could be as simple as a missing “-“. It is going to get very serious.
If they started to use a decent programming language that do not allows & or && in the first place, but like & and ‘and’ (literally) that would help by such a factor…
The language makes this case look like a syntax error, but the distinction between & and && is subtle whatever names you use for them. You could argue this is a type error (&& should only apply to booleans, & to bitfields), but there will always be subtle logic errors that creep in to code. It’s the nature of the beast… and the undecidablility of code.
Yet some programming language syntax are more robust.
I don’t disagree with you and we should always aim to refine and improve. But I guess my point is that no matter how good a language is at discouraging errors like this, it doesn’t replace the need for good practices that go beyond relying on language features (e.g. good reviewing and testing processes). And even then this sort of thing will still happen.
flypig,
Unless we’re contemplating new languages, I don’t think it’s realistic to fix existing languages at this point because changes would cause massive chaos and confusion. Still it’s not ideal that C (and it’s derivatives) reuse so many symbols. It’s not just bit manipulations but also pointers that reuse the same & symbol.
Bit-field manipulation is relatively rare in most modern programming despite the fact that it can apply to any integer, those probably could be replaced with a function. Javascript and PHP copies a lot of syntax from C, but I doubt 1/10000 of php/javascript developers have intentionally done bitfield manipulation in those languages. While I’ve used bit manipulation in C, I still think it’s far less common than boolean & pointer logic. Also I miss not having an exponentiation operator and I think “^” would be more useful for exponentiation than xor.
Languages like C++ can allow us to overload operators to do whatever we want using types, so we could create a class where ^ does exponentiation, so basing it on types could technically work. This is a cool feature, but IMHO allowing operators to have different meanings only increases confusion and surprises.
Yes, being able to repurpose syntax and names lies at the heart of C++ polymorphism. Some people love it, some people hate it.
You might like this paper by Nanz and Furia. It’s a great empirical resource in relation to this, especially RQ5 where they look at which languages are more or less failure prone (likelihood of runtime failures).
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.742.2902&rep=rep1&type=pdf
Go has a clear lead over other languages and they conclude:
At the other end of the scale, the interpreted languages (Python, Ruby) exhibit more runtime errors. It’s a bit surprising, but Java turns out to be an anomaly: both compiled and runtime error-prone.
Sadly Rust isn’t included in the results. The paper’s a few years old now, so things may have changed since it was published.
flypig,
I wasn’t referring to polymorphism though. I was referring to operator overloading where you can redefine what operators like “+” and “++” do, etc…
https://en.cppreference.com/w/cpp/language/operators
It’s cool when your learning about it, but then you should never do it again, haha.
Regarding polymorphism, it’s very useful to be able to create multiple classes and make them implement an API. C++ would be worse without such an ability, however the way C++ uses multiple inheritance is somewhat worse compared to other languages that have class interfaces, IMHO.
I personally despite Go’s use of forced letter casing though. In principal a language should be able to use unicode variable names, including letters that don’t have a case. Plenty of people get by with it, but IMHO it’s a shortcoming that other languages don’t have. But that’s just me.
Rust has problems with forwards/backwards compatibility between builds, which needs to improve. But they’re pushing the boundaries of code verification further than any other language I know of.