Tuesday, December 26, 2017

Embedded System Safety Lecture Video Series

I've posted the full series of embedded system safety lecture videos on YouTube.  These are full-length narrated slides of the core set of safety topics from my new course.  They concentrate on getting the big picture about failures, redundancy management, and safety integrity.
Each of the videos is posted to YouTube as a playlist, with each video covering one substantive slide. The full lecture consists of playing the entire play list, with most lectures being 5-7 videos in sequence. (The slide download has been updated for my Fall 2017 course, so in general has a little more material than the original video. They'll get synchronized eventually, but for now this is what I have.)

Obviously there is more to safety than just these topics. Supporting topics such as code quality and development approaches are currently just available as slides.  You can see the full set of course slides here:

Monday, November 27, 2017

Embedded Software Course Notes On-Line

I'm just wrapping up my first semester teaching a new course on embedded system software. It covers code quality, safety, and security. Below is table of lecture handouts.

18-642 Embedded System Software Engineering
Prof. Philip Koopman, Carnegie Mellon University, Fall 2017

1Course IntroductionSoftware is eating the world; embedded applications and markets; bad code is a problem; coding is 0% of software; truths and management misconceptions
2Software Development ProcessesWaterfall; swiss cheese model; lessons learned in software; V model; design vs. code; agile methods; agile for embedded
3Global VariablesGlobal vs. static variables; avoiding and removing globals
4Spaghetti CodeMcCabe Cyclomatic Complexity (MCC); SCC; Spaghetti Factor (SF)
5Unit TestingBlack box testing; white box testing; unit testing strategies; MCDC coverage; unit testing frameworks (cunit)
6Modal Code/StatechartsStatechart elements; statechart example; statechart implementation
7Peer ReviewsEffective code quality practices, peer review efficiency and effectiveness; Fagan inspections; rules for peer review; review report; perspective-based reviews; review checklist; case study; economics of peer review
8Code Style/HumansMaking code easy to read; good code hygiene; avoiding premature optimization; coding style
9Code Style/LanguagePitfalls and problems with C; language use guidelines and analysis tools; using language wisely (strong typing); Mars Climate Orbiter; deviations & legacy code
10Testing QualitySmoke testing, exploratory testing; methodical test coverage; types of testing; testing philosophy; coverage; testing resources
11RequirementsAriane 5 flight 501; rules for good requirements; problematic requirements; extra-functional requirements; requirements approaches; ambiguity
12System-Level TestFirst bug story; effective test plans; testing won't find all bugs; F-22 Raptor date line bug; bug farms; risks of bad software
13SW ArchitectureHigh Level Design (HLD); boxes and arrows; sequence diagrams (SD); statechart to SD relationship; 2011 Health Plan chart
14Integration TestingIntegration test approaches; tracing integration tests to SDs; network message testing; using SDs to generate unit tests
15TraceabilityTraceability across the V; examples; best practices
16SQA isn't testingSQA elements; audits; SQA as coaching staff; cost of defect fixes over project cycle
17Lifecycle CMA400M crash; version control; configuration management; long lifecycles
18MaintenanceBug fix cycle; bug prioritization; maintenance as a large cost driver; technical debt
19Process Key MetricsTester to developer ratio; code productivity; peer review effectiveness
33Date Time ManagementKeeping time; time terminology; clock synchronization; time zones; DST; local time; sunrise/sunset; mobility and time; date line; GMT/UTC; leap years; leap seconds; time rollovers; Zune leap year bug; internationalization.
21Floating Point PitfallsFloating point formats; special values; NaN and robots; roundoff errors; Patriot Missile mishap
23Stack OverflowStack overflow mechanics; memory corruption; stack sentinels; static analysis; memory protection; avoid recursion
25Race ConditionsTherac 25; race condition example; disabling interrupts; mutex; blocking time; priority inversion; priority inheritance; Mars Pathfinder
27Data IntegritySources of faults; soft errors; Hamming distance; parity; mirroring; SECDED; checksum; CRC
20Safety+Security OverviewChallenges of embedded code; it only takes one line of bad code; problems with large scale production; your products live or die by their software; considering the worst case; designing for safety; security matters; industrial controls as targets; designing for security; testing isn't enough
Fiat Chrysler jeep hack; Ford Mytouch update; Toyota UA code quality; Heartbleed; Nest thermostats; Honda UA recall; Samsung keyboard bug; hospital infusion pumps; LIFX smart lightbulbs; German steel mill hack; Ukraine power hack; SCADA attack data; Shodan; traffic light control vulnerability; hydroelectric plant vulnerability; zero-day shopping list
22DependabilityDependability; availability; Windows 2000 server crash; reliability; serial and parallel reliability; example reliability calculation; other aspects of dependability
24Critical SystemsSafety critical vs. mission critical; worst case and safety; HVAC malfunction hazard; Safety Integrity Levels (SIL); Bhopal; IEC 61508; fleet exposure
26Safety PlanSafety plan elements; functional safety approaches; hazards & risks; safety goals & safety requirements; FMEA; FTA; safety case (GSN)
28Safety RequirementsIdentifying safety-related requirements; safety envelope; Doer/Checker pattern
29Single Points of FailureFault containment regions (FCR); Toyota UA single point failure; multi-channel pattern; monitor pattern; safety gate pattern; correlated & accumulated faults
30SIL IsolationIsolating different SILs, mixed-SIL interference sources; mitigating cross-SIL interference; isolation and security; CarShark hack
31Redundancy ManagementBellingham WA gasoline pipeline mishap; redundancy for availability; redundancy for fault detection; Ariane 5 Flight 501; fail operational; triplex modular redundancy (TMR) 2-of-3 pattern; dual 2-of-2 pattern; high-SIL Doer/Checker pattern; diagnostic effectiveness and proof tests
32Safety Architecture PatternsSupplemental lecture with more detail on patterns: low SIL; self-diagnosis; partitioning; fail operational; voting; fail silent; dual 2-of-2; Ariane 5 Flight 501; fail silent patterns (low, high, mixed SIL); high availability mixed SIL pattern
34Security PlanSecurity plan elements; Target Attack; security requirements; threats; vulnerabilities; mitigation; validation
35CryptographyConfusion & diffusion; Caesar cipher; frequency analysis; Enigma; Lorenz & Colossus; DES; AES; public key cryptography; secure hashing; digital signatures; certificates; PKI; encrypting vs. signing for firmware update
36Security ThreatsStuxnet; attack motivation; attacker threat levels; DirectTV piracy; operational environment; porous firewalls; Davis Besse incident; BlueSniper rifle; integrity; authentication; secrecy; privacy; LG Smart TV privacy; DoS/DDos; feature activation; St. Jude pacemaker recall
37Security VulnerabilitiesExploit vs. attack; Kettle spambot; weak passwords; master passwords; crypto key length; Mirai botnet attack; crypto mistakes; LIFX revisited; CarShark revisited; chip peels; hidden functionality; counterfeit systems; cloud connected devices; embedded-specific attacks
38Security Mitigation ValidationPassword strength; storing passwords & salt/pepper/key stretching; Adobe password hack; least privilege; Jeep firewall hack; secure update; secure boot; encryption vs. signing revisited; penetration testing; code analysis; other security approaches; rubber hose attack
39Security PitfallsKonami code; security via obscurity; hotel lock USB hack; Kerckhoff's principle; hospital WPA setup hack; DECSS; Lodz tram attack; proper use of cryptography; zero day exploits; security snake oil; realities of in-system firewalls; aircraft infotainment and firewalls; zombie road sign hack

Note that in Spring 2018 these are likely to be updated, so if want to see the latest also check the main course page:  https://www.ece.cmu.edu/~ece642/   For other lectures and copyright notes, please see my general lecture notes & video page: https://users.ece.cmu.edu/~koopman/lectures/index.html

Friday, November 17, 2017

Highly Autonomous Vehicle Validation

Here are the slides from my TechAD talk today.

Highly Autonomous Vehicle Validation from Philip Koopman

Highly Autonomous Vehicle Validation: it's more than just road testing!
- Why a billion miles of testing might not be enough to ensure self-driving car safety.
- Why it's important to distinguish testing for requirements validation vs. testing for implementation validation.
- Why machine learning is the hard part of mapping autonomy validation to ISO 26262

Monday, October 9, 2017

Top Five Embedded Software Management Misconceptions

Here are five common management-level misconceptions I run into when I do design reviews of embedded systems. How many of these have you seen recently?

(1) Getting to compiled code quickly indicates progress. (FALSE!)

Many projects are judged by "coding completed" to indicate progress.  Once the code has been written, compiles, and kind of runs for a few minutes without crashing, management figures that they are 90% there.  In reality, a variant of the 90/90 rule holds:  the first 90% of the project is in coding, and the second 90% is in debugging.

Measuring teams on code completion pressures them to skip design and peer reviews, ending up with buggy code. Take the time to do it right up front, and you'll more than make up for those "delays" with fewer problems later in the development cycle.  Rather than measure "code completed" do something more useful, like measure the fraction of modules with "peer review completed" (and defects found in peer review corrected).  There are many reasonable ways to manage, but waterfall-ish projects that treat "code completed" as the most critical milestone is not one of them.

(2) Smart developers can write production-quality code on a long weekend (FALSE!)

Alternate form: marketing sets both requirements and end date without engineering getting a chance to spend enough time on a preliminary design to figure out if it can actually be done.

The true bit is anyone can slap together some code that doesn't work.  Some folks can slap together code in a long weekend that almost works.  But even the best of us can only push so many lines of code in a short amount of time without making mistakes, much less producing something anyone else can understand.  Many of us remember putting together hundreds or thousands of lines on an all-nighter when we were students. That should not be mistaken for writing production embedded code.

Good embedded code tends to cost about an hour for every 1 or 2 lines of non-comment code all-in, including testing (on a really good day 3 lines/hr).  Some teams come from the Lake Wobegone school, where all the programmers are above average.  (Is that really true for your team?  Really?  Good for you!  But you still have to pay attention to the other four items on this list.)  And sure, you can game this metric if you try. Nonetheless, it is remarkable how often I see a number well above about 2 SLOC/hour of deeply embedded code corresponding to a project that is in trouble.

Regardless of the precise productivity number, if you want your system to really work, you need to treat software development as a core competency.  You need an appropriately methodical and rigorous engineering process. Slapping together code quickly gives the illusion of progress, but it doesn't produce reliable products for full-scale production.

(3) A “mostly working,” undisciplined prototype can be deployed.  (FALSE!)

Quick and dirty prototypes provide value by giving stakeholders an idea of what to expect and allowing iterations to converge on the right product. They are invaluable for solidifying nebulous requirements. However, such a prototype should not be mistaken for an actual product!   If you've hacked together a prototype, in my experience it's always more expensive to clean up the mess than it is to take a step back and start a project from scratch or a stable production code base.

What the prototype gives you is a solid sense of requirements and some insight into pitfalls in design.

A well executed incremental deployment strategy can be a compromise to iteratively add functionality if you don't know all your requirements up front. But an well-run Agile project is not what I'm talking about when I say "undisciplined prototype." A cool proof of concept can be very valuable.  It should not be mistaken for production code.

(4) Testing improves software quality (FALSE!)

If there are code quality problems (possibly caused by trying to bring an undisciplined prototype to market), the usual hammer that is brought to bear is more testing.  Nobody ever solved code quality problems by testing. All that testing does is make buggy code a little less buggy. If you've got spaghetti code that is full of bugs, testing can't possibly fix that. And testing will generally miss most subtle timing bugs and non-obvious edge cases.

If you're seeing lots of bugs in system test, your best bet is to use testing to find bug farms. The 90/10 rule applies: many times 90% of the bugs are in bug farms -- the worst 10% of the modules. That's only an approximate ratio, but regardless of the exact number, if you're seeing a lot of system test failures then there is a good chance some modules are especially bug-prone.  Generally the problem is not simply programming errors, but rather poor design of these bug-prone modules that makes bugs inevitable. When you identify a bug farm, throw the offending module away, redesign it clean, and write the code from scratch. It's tempting to think that each bug is the last one, but after you've found more than a handful of bugs in a module, who are you kidding? Especially if it's spaghetti code, bug farms will always be one bug away from being done, and you'll never get out of system test cleanly.

(5) Peer review is too expensive (FALSE!)

Many, many projects skip peer review to get to completed code (see item #1 above). They feel that they just don't have time to do peer reviews. However, good peer reviews are going to find 50-75% of your bugs before you ever get to testing, and do so for about 10% of your development budget.  How can you not afford peer reviews?   (Answer: you don't have time to do peer reviews because you're too busy writing bugs!)

Have you run into another management misconception on a par with these? Let me know what you think!

Friday, September 22, 2017