"Given the undeniable trend towards all-encompassing change in software development, the case can be made that general purpose software is doomed to always be unreliable and buggy."




"Is this some sort of collective insanity that has somehow woven its way into our society?"

30.6.08

 

Why Software Projects Fail

OK, enough analysis and example. Whether you agree with my approach, my points, or even my page layout (yes, I know some of you hate it, but then again, it is my page right?), I think nearly all of us can agree on one point: Software has plenty of room for improvement.

In the big scheme of all things technical, programming is pretty new. Not only that, but since languages change over the course of a few years and new ones crop up regularly (Ruby and .Net for example), it remains pretty new. While traditional engineering disciplines have decades of trial and error, research, and incremental improvement, software is being re-invented about once every decade. Think about what environments you were working in ten years ago (that would be 1998). How different was it from what you are working in today? If you are on the web side of software development, chances are that your language of choice has changed along with many, many techniques and methods. In contrast, how is a civil engineer's job different today than it was in 1998? There are probably more computer-based tools involved, but I would think that the actual core tasks have remained the same.

I bring this comparison up not to further antagonize those who disagree with the "Software vs. Physical" comparison, but to show that the one constant in Software Development that works against it having a solid "engineering" standing: Change.

In many ways, change is the strength of software. When the typical application user was on a dial-up connection and actually connected to the internet to check their email (as opposed to "always connected" broadband), web development tools were designed to deliver lightweight, low bandwidth user interfaces. As broadband seeps out of city cores and into the suburbs, then rural areas, bandwidth becomes much less of an issue and rich user interfaces are now the goal. As the requirements for applications shifted, the software development world reacted and new, richer web development languages were created and became commonplace. This type of constant changing of the core fabric that makes up a discipline would not be possible in more "structured" disciplines such as mechanical, electrical, or civil engineering. Even if it were possible, it would not be accepted by the majority of the engineers who would be doing the work. For engineers, there is safety in standards and the time-proven techniques of the past. For software developers, there is safety in the new, "sky is the limit" world of constantly changing standards, protocols and techniques.

Given the undeniable trend towards all-encompassing change in software development, the case can be made that general purpose software is doomed to always be unreliable and buggy. I do not believe that view though. My view is that the reason software is not reliable is not that the techniques and languages are always "first generation".

It has been pointed out by several readers of this blog that reliable software does exist. They cite traffic control systems, military equipment, navigation systems, and nuclear power controls. So what is the difference between software that guides massive tankers through the ocean in the dark of night and that new desktop application that you just shelled out $59.95 for that keeps crashing everytime you try to "undo"? It is obviously possible to create reliable software, so why is most consumer-level software so bad while most critical infrastructure software so good?

This question could be answered from an economic standpoint. It could be a question of how much money a company is willing to spend to get a $60 piece of software completely error free. After all, if the worst thing that happens when it crashes is you lose the edits you were making to a photo, how much will the consumer (you) be willing to pay for reliability? If the software were absolutely reliable but cost $6000 instead of $60, would you be willing to pay $5940 for that reliability? In most cases the answer is a resounding "NO!". It would be silly to claim that cost does not play a factor in the level of reliability of any piece of software.

This would almost serve as a satisfying answer to the question of software reliability. It is reliable as you are willing to pay for. Almost. But what about the exceptions to the "more money = more reliability" axiom? There are many projects that I could use to illustrate my point and there are many that could be used to illustrate the opposite of my point. But I chose this example for a reason which I will hopefully make obvious.

There was a project that was undertaken by SAIC in 1990 which qualifies as one of these exceptions and will serve to illustrate my point. The project was an FBI modernization effort called "Project Trilogy", and later, the "Virtual Case File". This was a very well funded project whose prime contractor was a very experienced enterprise development company. Despite adequate funding though, the project fell further and further behind until it was finally scrapped altogether.

The factors that were cited as the prime causes of failure are well known by most developers who have experience on large projects:

  • Lack of a strong blueprint from the outset led to poor architectural decisions.
  • Repeated changes in specification.
  • Repeated turnover of management, which contributed to the specification problem.
  • Micromanagement of software developers.
  • The inclusion of many FBI Personnel who had little or no formal training in computer science as managers and even engineers on the project.
  • Scope creep as the requirements were continually added to the system even as it was falling behind schedule.
  • Code bloat due to changing specifications and scope creep. At one point it was estimated the software had over 700,000 lines of code.
  • Addition of more people and resources to the project as it was falling behind, which made it later still (Brooks's law).
  • Planned use of a flash cutover deployment, which made it difficult to adopt the system until it was perfected.

Of the causes cited, there are two real root causes:

1. Lack of planning.
2. Lack of change management.

Another project that met a similar, although not quite as fatal end is the Chandler project which was documented in the book Dreaming In Code. Envisioned as the ultimate personal information management application, it suffered from perpetually missed deadlines, runaway cost, and a serious tendency to wander in the code wilderness.

Many things can be blamed for projects like these, but it is clear from a careful review of the facts that a lack of funds was not the primary factor. Both of these projects were well funded from the start. There must have been other factors that doomed them. It is apparent that a big part of the problems are rooted in bad management decisions, but I think there is something more basic and more well-defined than bumbling managment that, had it been present, would have made the difference between collosial failures and at least modest successes.

I believe the thing that is missing in these projects and in software development in general which makes all of the difference is:

Change Management

So there you have it. You have been lured into a discussion that ends up being about a typically VERY boring topic. Boring as it is perceived though, I believe that Change Management, and its big brother Configuration Management, are absolutely critical to producing reliable software.

I don't think anybody starts out trying to build a failed project. Project designers, architects, and planners plan on many things: hardware, software, staffing, funding, schedule and many other aspects of a typical project. What many of them miss, however, is planning on change. A plan is only good if it is followed. When circumstances require that the original plan change, it becomes crippled. That is because the architecture, budget, schedule, staffing, and all of the other things that go into a project are all based on the original view of what is needed. A good software project knows that change is inevitable and plans for it. That is why NASA's MARS rover software is reliable. As new requirements were realized, they had a process that forced all of the assumptions that were made based on the original plans to be reconsidered.

Taken to the extreme, does this mean that all planning is pointless and that all software should just be designed by creating a series of changes until the desired result is achieved? Maybe. I think that is called "Agile".

Side note: This reminds me of a new methodology I created after working in an "Agile" environment for a year:

1. Create something. Anything. "Hello World", CD organizer, whatever.
2. Submit to customer.
3. Make changes based on their feedback, giving them exactly what they asked for.
4. Repeat until the budget runs out.

While Agile methods have their place, I don't think that large software development projects are ready for them.

My point is not that planning is futile, but that a successful project must plan for change as part of the development process. The simple fact is that all of the requirements and details cannot be known up front. This is the fatal flaw of the "waterfall" methodology with large projects. The goal in planning should be to plan for the things that are known, and also plan for the things that are not known. This is where most large project failures happen. Change must be part of the marrow of the project and part of its natural flow. Project team members should not have to work against change, but rather view it as a way to make the end result better.

So now that you know what my secret topic is, it is up to you whether or not you wish to continue to read my posts. It is my hope that I have drawn you in enough to at least consider that no large project is likely to succeed without adequate planning for change management, and through that avenue I can keep you interested enough to continue to explore how this approach can make software better at all levels.


M@

27.6.08

 

Software Engineering Revisited

The response to my previous article about Software Engineering (or the lack thereof) resulted in an impressive number of responses. This is apparently a topic that a lot of people have an interest in. Someone posted a link to the article on Reddit.com and I have to say, those folks are brutal critics. But as with most critics, they have a very annoying way of being right.

I chose to introduce the topic of software problems by comparing software engineering with other engineering disciplines. It was pointed out a few times in comments on Reddit (ok, probably 30 times) that this was probably not a valid comparison, particularly the way I went about it. While some people may be offended by the sometimes blunt criticisms, I am not. It shows that I am being read by some very intelligent and clear thinking people. Accountability is good, and if I propose a concept, I should be prepared to back it up. Your comments (good or bad, eloquent or blunt) are always welcome and will always be posted on this blog...as long as they aren't just being abusive or spam. Yes, I get to decide what falls in those categories.

First, I want to agree with those of you who said that it was a thin argument to compare engineering disasters with software disasters. As one person put it "Nobody dies if your computer has to be rebooted". True enough. So in this article I will bring the discussion to a more balanced comparison.

It was pointed out that, like all engineering tasks, software engineering costs money. For most software systems, there is a cost/reliability trade off. For example, NASA spent years developing the software that controls the flight of the space shuttle. Nobody on the shuttle software team wakes up one day with a cool new idea for rewriting the nozzle control modules and just throws it in there to see how well it works. I seriously doubt the air traffic control system is open to the latest AJAX tricks or cool new interfaces. These are the types of systems that CAN kill people if they don't work every time, and much energy and expense are put into getting them right. Even so, these systems do sometimes fail. And like other engineering disasters, they make the news. This very thing happened in 2004 when a software error caused a major portion of the east coast's power grid to shut down. It also happened when Britain's air traffic control system experienced a software glitch and grounded hundreds of flights while the system was restored.

These are not the systems I am interested in. I am interested in the everyday software that you, me, my bank teller, my insurance agent, your doctor use. Not the life support systems, mind you. The record keeping systems, the appointment schedulers, and the handheld computers.

I can hear the would-be lawyers and philosophers out there already raising an objection. "He is sidestepping the issue...he started on a premise and then changed it on us." OK. You got me. But I did it for a reason. Stay with me and I think you will agree.

Before I go on, I want to bring the argument into perspective. I agree that the risk to human life associated with a particular application of ANY engineered system, hardware or software plays a strong role in determining how reliable it will be. Obviously a glitch in a tic-tac-toe game is not going to require the level of reliability that the software running a bullet train will. The same can be said about physical devices. Obviously the reliability of a lint roller is not going to be as high as the reliability of a suspension bridge. As far as I know, nobody has ever died as the direct result of having cat hair on their pants (although some have occasionally felt like they would when said cat hair was discovered during a job interview). I agree with this and I stand soundly reprimanded for my less than accurate comparisons of long ago...OK yesterday.

Now, ladies and gentlemen, watch closely as I perform an impossible feat before your very eyes. I will, without any smoke or mirrors, transport myself from a place of ridicule and doubt to a platform of infallible truth...(I hope)...

At this point I think we all agree that not all software is created equal. I have never had the pleasure of working on a software project where quality was top priority but I know such projects exist. My question is (watch carefully): Why is some software really good while the majority of it is really, really bad? What do the developers who produce reliable software do differently than the ones who produce really flaky software?

It has been lovingly pointed out by many people responding to my last article that I am a dolt for comparing software to real-world engineering projects. Why? Because software does not have the limits of the physical world. The only limits are the limits of the developers' imagination. That and what the operating system is capable of supporting...and what the chosen language is capable of implementing...and what the hardware is capable of running. So it looks to me like there are actually limitations to what software can do. It is the practice of most development projects (especially in the gaming industry) to push those limits as hard as possible without breaking them. So although the limitations of a project are not necessarily physical, they do exist.

Before I dig much deeper (dig myself much deeper?), I would like to point out that, as boring as it is, I am really only talking about business software. It isn't that I have anything against gaming software or social networks or cool media players, it is just that I don't have any experience with those types of applications. My work has been almost exclusively in the realm of database driven business applications. Boring, I know. But it pays the bills. I am sure that what I am saying applies to many other types of development but I just haven't got the expertise to judge that. Actually, I expect that we business developers could learn a lot from the game developers out there.

So the question now is: Why doesn't most business software work very well? Now that we have lost all of the gamers, social networkers, and "software artisans", maybe we can make some headway.

Judging from many of the comments I received about the Software Engineering article, it is a pretty widely held belief that the main determining factor of software quality is the amount of money that a company is willing to throw at it. This, apparently, can be programatically determined by considering how many people the product is likely to kill or maim. This seems like a sound, logical argument, but it isn't. I have used some software that is very reliable and is not likely to maim anyone. And if the amount of money thrown at a project dictates its reliability, why are some of the most stable applications available for free?

I believe that there is an answer to these questions and it is something that is right before our eyes, and has been since the 1980s...maybe earlier. I intend to prove in articles in the near future that what is missing in software, what is making us all hate our computers, is not the likelihood of consequences (kill count), is not the amount of money that is spent producing it, and is not the number or credentials of the developers creating it. It is much simpler than all of that. Stick around...I am pretty sure you will find it worth following along with....

M@

26.6.08

 

Software Engineering? Maybe not.

A plane crashes, killing all 287 people on board. A bridge collapses plunging thirteen motorists to their watery graves. A construction crane buckles and crushes an apartment building, killing several workers and residents. We have all heard these stories on the news and they are always followed by outrage, someone asking how this could have happened, and many extensive investigations into the cause. Even when the death toll is very low or nobody is actually killed, these engineering failures stir a deep anger in the average person whose typical reaction is to shake their heads and wonder who failed to do their job.

Have you ever stopped to wonder why engineering disasters are shocking? Why is it that people are outraged when New Orleans is flooded after Hurricane Katrina because of a failed levee, but there is no outrage as tens of thousands of homes, businesses, and lives are destroyed by the same storm all along the Mississippi coast? I think the answer is that there is a distinct difference between “acts of God” such as hurricanes and tornados and disasters that are the result of man-made failure.

Without really thinking about it, everyone who lives in a modern society routinely puts their lives in the hands of strangers. These are the people who design, build and maintain infrastructure such as bridges, tunnels, highways, traffic signals, railroad crossings, skyscrapers and even our houses. They are engineers and they carry a huge responsibility to each and every one of us who trust and rely on the fruits of their labors. Very few rational people expect a bridge to fall out from under their vehicle as they take the kids to the movies or the theatre to collapse on them once they get there. As a society, we have become accustomed, even complacent with the idea of safe engineering. It never enters our minds as we walk into a large governmental building with huge marble plates overhead that gravity is doing its best to bring those 20,000 pound slabs of rock down on us. We trust that the engineer who designed it and the workers who built it undertook their jobs in a way that ensured that gravity would never win and we walk under marble slabs, steel beams, and concrete ceilings without even glancing up.

Is this some sort of collective insanity that has somehow woven its way into our society? If a perfectly sane person from the 1700’s were transported through time to observe us in two ton shells of metal and glass hurdling towards each other at seventy miles per hour with nothing but a yellow line painted on the road to keep us from ramming each other while talking on a phone and eating a cheeseburger, or sleeping as we hurdled through the air at 650 MPH a mere six miles from the ground, would they consider us insane? I believe they would. But we know something that they don’t. These seemingly suicidal things that we do have proven time and again to be relatively safe. While it is true that not one person died in an auto accident of passenger jet crash in the entire century of 1700’s, it is also true that many died by falling off a horse.

The reason things that would seem completely insane to the uninitiated are met with a yawn with modern society is simple. Engineers and inspectors are doing their jobs. The simple fact that so few major engineering failures happen is a testament to how well they are doing their jobs. Only when they fail and a disaster occurs do we even give any thought to the tedious, methodical, and highly regulated work that they do.

When a civil engineer sets out to design a bridge, or an aerospace engineer sets out to design a new jet engine they are not free to proceed in any willy-nilly way that they want to. There are very specific and very extensive regulations that must be followed. These regulations are the result of years and years (in many cases, decades) of careful study and analysis. A bridge engineer must know how much the steel and concrete will expand in mid July as the temperature reaches 105 degrees and how much it will contract when it drops to -25 degrees in January. They must consider the strength of the materials being used, their flexibility, their reaction to heating up and cooling down, the long term effects of weather, salt spray, and friction and a thousand other parameters.

When a civil engineering project is completed, it must pass a series of reviews, tests, and inspections. When a new Navy ship is commissioned, it must pass a bewildering number of inspections, stress tests, and extended sea trials where the entire ship is pushed beyond its expected stress loads time and time again.

It is only through this extremely structured, regulated, monitored process that our society produces safe roads, passenger jets, Navy ships, automobiles, and power tools. Every time something doesn’t collapse, explode, melt or crash into something else, this is the process that we fail to thank for producing safe engineered products.

And then there is software.

Most universities now have a curriculum called “Software Engineering”. This should not be confused with the usage of the same word in, for example, Electrical Engineering. Why? Because Electrical Engineering is a discipline and is taught as such. In order to receive a degree, Electrical Engineers must understand the physics of electricity, semi-conductors, switches, logic circuits, and many, many other concepts. They must know the industry standard methods of practicing their craft. These standards are published and maintained by recognized authorities and must be complied with for all electrical engineering tasks. These regulations and standards specify everything from the way wires are joined to the type of enclosures are to be used in very specific applications. They regulate how many devices can be powered by certain size wires and breakers. They regulate how individual devices, circuits, and entire buildings are grounded to insure that when a component fails, excess energy is not suddenly routed through someone with their finger on a control switch. The mind-boggling array of things which must be considered goes on and on. It takes a truly dedicated, detail-minded person to be a successful Electrical Engineering graduate.

And then there is Software Engineering. Ahh….the freedom. The software industry is not burdened by the boring, endless sedimentary layers of standard and convention. The ground never cools on a software development environment before the next wave is rushing over it, redefining what it means to create software every three or four years. The typical Software Engineering degree teaches the basics. Language structure, logic, some process (although very little) and almost no standards are imparted on the unsuspecting student in the form of development languages that the corporate world abandoned decades earlier and only a few government agencies still cling to. The few standards they are taught are not really applicable to the development environments being used in a modern software development project, but that is no matter because it is a rare thing indeed for a Computer professor to even recognize development languages that came to dominance in the past decade or two.

Despite all of this though, the fresh Software Engineering graduate swallows the pill and believes he has been prepared for the real world. After all, he IS an ENGINEER. And engineers are respected. They are smart and they have real-world know-how, and the piece of paper they so proudly frame and hang in their cubical leaves no doubt that they are ready to conquer the world.

If someone actually believed that Software Engineering teaches anything like Engineering, you would expect at this point for the new engineer to start producing some highly standardized software. Someone would be wrong in that expectation. The first problem that most new software engineers run into is that there ARE no standards. Sure, there are some general guidelines. CamelCase or ALL_CAPS_UNDERSCORE? /****HEADERS FOR EACH PROCEDURE?***/ All required fields are RED while all optional ones are TEAL. Those sorts of standards, as misguided as they may be. But when it comes to how really important software is designed, assembled, tested, and regulated, they are likely to come up against blank stares from their more experienced colleagues. This is because (if they even think to ask about it), they will have run up against the dirty little software engineering secret: THERE ARE NO RULES.

"Certainly that is an exaggeration", I can already hear the masses exclaiming. Well, lets pretend for a moment that I have not been designing and developing software for over thirteen years and we will just look at the signs. It is sound logic (although not 100% airtight) that when good engineering practices are followed, good results are achieved. This is precisely why so few bridges collapse and so few waffle irons electrocute unsuspecting half-awake moms. When standards exist, standardized results are obtained. This is working on the assumption that “exist” in this case means that they are not only defined, but followed. The inverse of this observation is that (most likely), a lack of good engineering practices results in poor, substandard results.

In 2000 and 2001, Ford Motor Company had to recall all of its Ford Explorer vehicles due to an engineering problem with one of its components: the Firestone tires that came as part of a standard Explorer package. Although the tires passed all required safety tests and the Explorer did as well, the combination of the two turned out to be a disaster, directly causing the death of well over 100 motorists when the tires blew out at high speed. Since the Explorer’s center of gravity is higher than many other vehicles, this resulting in a high percentage of vehicle rollovers which are much more lethal than non-rollover crashes. This was an engineering failure, albeit a subtle one. As a result of this failure, hundreds of thousands of Explorers were recalled and the tires replaced.

In our discussion though, the focus of this example is not on the failure, but on the incredibly small number of failures in the industry as a whole. Of all of the millions of cars manufactured in 200 and 2001 by dozens of manufacturers, this is the story that made the headlines. The reason it made headlines is precisely because it was unusual. How many headlines do you see praising all car manufacturers for not making cars that roll over a lot? As I have always said, routine things aren’t news…they are expected.

Of the 75,000 or so Explorer owners, fewer than 200 experienced a blowout-related rollover. If you happen to be one of those people (as a young woman I knew at the time was), the fact that this represents a failure rate of .003% - or 1 in 3000) is not of much consolation, but consider this statistic. If your computer software was only as reliable as the recalled tires and you used it for three hours per day, you would go an average of 1000 days (or just under three years) between software errors. No blue screen, no "Memory Exception Error", no "There is no message for this error" for almost three years!

If this number seems amazing to you, it shouldn’t. I have driven my truck for 220,000 miles with routine maintenance and have had very few major components fail. My brakes have always worked, it always cranks, the transmission always shifts. While this is an extreme example (Toyota Tacomas are very well engineered!), consider your own vehicle. How many miles seem “normal” before a vehicle develops major mechanical problems? If you drive a Saab or Peugeot, the answer may be pretty low (one week?), but if you drive a Honda, Toyota, or a late model American vehicle, you expect to be problem-free for at least the first three years or 50,000 miles. Apparently the manufacturer does too, because that is what the warranty typically covers.

Now consider your computer. When you took this wonderful, shiny, amazing machine out of its box and plugged it in, how long did the bliss of having a new computer last before you encountered your first software error? I can be quite certain it wasn’t three years or 26,000 hours. It was probably more like thirty minutes or maybe (on the extreme end) a week. Based on the results we expect from cars, bridges, waffle irons, and hydroelectric dams, this an extremely pathetic failure rate by any measure.

The ridiculous failure rate of software indicates that, unlike electrical, mechanical, or civil engineering, software engineering does not have the stringent standards, regulations, and quality control requirements that other engineering disciplines take for granted. As a good book I once read says “You shall know them by their fruits”. The fruits of software engineering smell pretty rotten to me.

Over time I will offer an analysis of why software doesn’t work very well. I am an insider, but I am a reformed one. I have spent a lot of my professional life wondering why software projects always erode to a mad scramble to get something (anything!) working in time to ship it out. I have been schooled in the evil ways of market driven software development and I have been repeatedly disappointed. But in the process I have learned a few things, and the answer to why software doesn’t work very well may surprise you. Stick around.

M@