John Shook: “A Technical Problem or a People Problem?”

John Shook dives into some of the messy issues of true root cause in his most recent post.

We touched on a similar issue here a few months ago. But it is always worth coming back around to people because because in this system (actually in any system) there are always two issues with people.

  1. People are the most fallable part of the process.
  2. The process cannot operate without them.

The reflex is often to go into total denial about #1 and expect people to be vigilant and perfect every time. “Weed out the bad apples, and everything will be fine.” Of course that doesn’t work.

In John Shook’s example, he traced through Ohno’s classic “5 Why” example.

1. Why did the machine stop?
There was an overload and the fuse blew.
2. Why was there an overload?
The bearing was not sufficiently lubricated.
3. Why was it not lubricated sufficiently?
The lubrication pump was not working sufficiently.
4. Why was it not pumping sufficiently?
The shaft of the pump was warn and rattling.
5. Why was the shaft worn out?
There was no strainer attached, and metal scrap got in.

Then Shook asks a really interesting question:

Why was no strainer attached?

Why not indeed? Isn’t that somebody’s job?
And now, as he points out, we have transitioned from “technical” to “people.”

Maybe the standard work for the maintenance worker or machine operator didn’t go far enough. Or maybe the standard work did specify changing the strainer but the worker failed to observe the standard. How was the standard developed, how was it communicated and trained? How easy was it to “forget” to change the strainer?

Coming, as I do, from mostly “brownfield” environments, the existance of standard work in the first place isn’t something that we can take for granted.

Nevertheless, Shook is making a critical point here. It does not matter whether there was no standard work, or whether the standard work broke down for some reason that we do not yet know (another “Why?”). This is still a process problem. We must start with a working assumption that the team member cares, and is doing the very best job possible, given the expectations, the resources, and his understanding at the time.

I am aware of a couple of cases where engineering change implementation pulled up short of actually observing the new installation and looking for unforeseen problems. One of them was quite subtle, and actually took a few weeks to find the basic cause, much less the root cause.

Another resulted in a bolt snapping during final torque. Messy to fix, but better in the factory than in the field.

These are additional cases where technical problems resulted from process breakdown, and in both cases, it was a case of unverified or blindly held assumptions, and not following through with the customer process.

Shook concludes with two really important points, and I can’t agree more. First:

…the work design must also include the “human factors” considerations that make it possible to do the job the right way, and even difficult to do it the wrong way.

I like to say “Make the right way the easy way” if you want things done in a certain manner.

Which brings us to Shook’s final point: You have to look at the total package – the human and the technical as an integrated system. You can’t separate them because. You can’t “take people out of the process.” All you can do is construct the process to give people’s minds the most opportunity to focus on improving the work rather than burdening them with making sure they get it right.

Always work to support people to do the right thing in the right way. If the organization carries a belief that it is necessary to force or “incentivize” people to do the right thing, then there is a people problem, but it isn’t with the workers.

Evidence of a Problem

In most references describing the process of good problem solving, the first real step is to explain what actually is the problem.

It is easy to get tripped up at this stage and describe the problem in terms of the desired target, or in terms of “lack of” a specific countermeasure. That, of course, skips over the whole point of gaining a deep understanding of the situation before moving too far into intestigating causes.

I heard a great way to frame that first bit of description tonight from Richard.

“What is the evidence of a problem?”

That word, “evidence” does a much better job of conveying the point that this should be a description of the things that are observed, heard, felt, etc. rather than any kind of analysis.

In many cases a “big problem” is actually evidence of many small ones. “Too much inventory” is one of those. So is “defect rates” or any other aggregated measure. If you are running KPIs you probably know that, because they aggregate so much, they are often relatively insensitive until things are so far into the hole that it is a mess to untangle. Better to instrument your processes at much finer levels and get “evidence” in real time.

How Do You Look At Problems?

A couple of posts ago, I tried to emphasize “hypothesis testing” as the key, core thinking behind the TPS. For that matter, I think that anyone who truly understands any of the various improvement approaches out there will find the same thinking at the core. Certainly Six Sigma; Theory of Constraints; and TQM are all about surfacing and solving problems. They may use different language, might insert the initial lever between different bricks, but in the end, the approaches all embrace the same basic thinking.

I’d like to put out there an idea that it is the way problems are regarded and approached that separates “gets it” from “business as usual.”

What Constitutes “a problem?”

In “traditional thinking” a problem is something which disrupts output. It is something serious enough that it cannot be ignored.

In a true continuous improvement mindset, anything that causes variation from the plan, in any way, is “a problem.” Any barrier between the current condition and the idealized world is “a problem.”

What triggers a response?

In “traditional thinking” if output isn’t disrupted, spend time elsewhere. There is a caveat to this, however. The parable of the “boiling frog” (whether true for actual frogs or not) can drive an ever higher level of numbness as “normalized deviance”   sets in.

Since continuous improvement is a process of discovering the ideal process, variation from the plan is new information. It must be investigated and understood. If everything is running smoothly, then the problem solving shifts to the next barrier to higher performance.

What triggers alarm in the organization?

This one may be the most controversial. While “stopped production” is certainly cause for alarm and immediate response, in the traditional thinking world, it is the only thing that really gets people’s attention.

In a thinking and learning organization, I would add to the above “No problems are apparent.” If there are no andons, there are no defects, there are no line stops, there are no shortages, there are no disruptions, then there is a BIG problem. I say that because these conditions are impossible and it is only because your system is totally numb that you would not see them.

Target Condition

Given the above, then I think it is safe to offer that silence is equated with “stability” in the traditionally reacting organization. Of course it isn’t stable at all, it is just that there is so much systemic anesthesia that nobody feels anything.

In the continuous improvement mindset, things are running as they should if there is a continuous flow of problem being surfaced and solved. That is the only way to be 100% certain that things are getting better every day.

“Management Commitment”

The term “management commitment” is tossed around as a prime reason for failure of improvement initiatives. There are lots of good reasons for this, but until we really define exactly what leaders need to do every day, stop using euphemisms, and start getting real about leadership’s actual role in this process, we are crutching the problem. This is partly “our fault” because we teach the basics very badly. We put top leaders into “kaizen events” but never explicitly link kaizen to daily problem solving. In doing so, we convince them that if only they support enough kaizen events, the organization will be transformed. The logical result is a monthly report on how many kaizen events have been run. Argh.

If we used kaizen events to explicitly teach the core questions, the rules of good process design, and the concept of applying PDCA to everything, we might get more traction. That can be difficult, but maybe if everyone in the industry starts thinking in terms of a few core mantras we might get a chorus going.

A3 – A Process, Not A Form

Kris Hallan is a frequent contributor on the LEI forum at
In this post, he outlines some great experiences with trying to implement the “A3 process” in his organization. Lean Forums – A3.

One thing that really drove home what goes wrong most of the time with the A3 process, and frankly, with most well-intentioned efforts to bring good analysis into organizations, was his experience of an early effort to try to “require” it without having the behaviors to back it up:

One of the worst things you can do is require an A3 be written and then allow a poor A3 to get past you. This has a tendency to happen when you put an A3 mandate on something that you don’t necessarily have control over. For instance, we started by requiring all CAPEX [capital expenditure – ed] projects to be proposed using the A3 format (hoping that the A3 thought process would come with the format).

What we got was a lot of projects summarized on A3s and virtually no feedback to go back and improve anything. No one learned anything from the process, no hanei occured, and nemawashi was non-existent. It became a box that everyone had to check. This can have a very detrimental effect on people’s attitude toward A3. Since they don’t take it seriously, they can’t really learn anything from it. I would say that this actually moved us backwards in our understanding of problem solving.

I could not agree more. I have seen this in other companies. This is PLAN-DO without the CHECK and ACTion. Set an expectation, go through the motions of compliance, but don’t ever bother to see if it is actually working the way that is expected.

The good news, further into his post, is that Kris’s organization figured it out and found that doing it thoroughly is more important (and quicker!) than doing it fast.

A Morning Market

In past posts, I have referred to an organization that implemented a “morning market” as a way to manage their problem solving efforts.

Synchronicity being what it is:

Barb, the driving force in the organization in my original story, wrote to tell me that their morning market is going strong – and remains the centerpiece of their problem solving culture. In 2008, she reports, their morning market drove close to 2000 non-trivial problems to ground. That is about 10/day. How well do you do?

Edited to add in March 2015: I heard from Barb again. It is still a core part of the organization’s culture.

So – to organizations trying to implement “a problem solving culture” and anyone else who is interested, I am going to get into some of the nuts and bolts of what we did there way back in 2003. (I am telling you when so that it sinks in that this is a change that has lasted and fundamentally altered the culture there.)

What is “A Morning Market?”

The term comes from Masaaki Imai’s book Gemba Kaizen pages 114 – 118. It is a short section, and does not give a lot of detail. The idea is to review defects “first thing in the morning when they are fresh” – thus the analogy to the early morning fish and produce markets. (For those who like Japanese jargon, the term is asaichi, but in general, with an English speaking audience, I prefer to use English terms.)

The concept is to display the actual defects, classified by what is known about them.

  • ‘A’ problems: The cause is known. Countermeasures can be implemented immediately.
  • ‘B’ problems: The cause is known, countermeasures are not known.
  • ‘C’ problems: Cause is unknown.

Each morning the new defects are touched, felt, understood. (The actual objects). Then the team organizes to solve the problem. The plant manager should visit all of the morning markets so he can keep tabs on the kinds of problems they are seeing.

Simple, eh?

Putting It Into Practice

Fortunately we were not the pioneers within the company. That honor goes to another part of the company who was more than willing to share what they had learned, but their key lesson was “put two pieces of tape on a table, divide it into thirds, label them A, B and C and just try it.”

They were right – as always, it is impossible to design a perfect process, but it is possible to discover one. Some key points:

  • There is a meeting, and it is called “the morning market” but the meeting does not get the problems solved. The difference between the organizations that made this work and the ones that didn’t was clear: To make it work it is vital to carve out dedicated time for the problem solvers to work on solving problems. It can’t be a “when you get around to it” thing, it must be purposeful, organized work.
  • The meeting is not a place to work on solving problems. There is a huge temptation to discuss details, ask questions, try to describe problems, make and take suggestions about what it might be, or what might be tried. It took draconian facilitation to keep this from happening.
  • The purpose of the meeting is to quickly review the status of what is being worked on, quickly review new problems that have come up, and quickly manage who is working on what for the next 24 hours. That’s it.
  • The morning market must be an integral part of an escalation process. The purpose is to work on real problems that have actually happened. Work on them as they come up.

Just Getting Started

This was all done in the background of trying to implement a moving assembly line. There is a long back story there, but suffice it to say that the idea of a “line stop” was just coming into play. Everyone knew the principle of stopping the line for problems, but there was no real experience with it. As the line was being developed, there were lots of stops just to determine the work sequence and timing. But now it was in production.

The first point of confusion was the duration of a line stop. Some were under the impression that the line would remain stopped until the root cause of the problem was understood. “No, the line remains stopped until the problem can be contained,” meaning that safe operations that assure quality are in place.

The escalation process evolved, and for the first time in a long time, manufacturing engineers started getting involved in manufacturing.

The actual morning market meeting revolved around a whiteboard. At least at first. When they started, they filled the board with problems in a day or two. They called and said they wanted to start a computer database to track the problems. I told them “get another board.” In a few days that board, too, filled up. It really made sense now to start a computer database. Nope. “Get another board.” That board started to fill.

Then something interesting happened. They started clearing problems.

PDCA – Refining The Process

The tracking board evolved a little bit over time.

The first change was to add a discrete column that called out what immediate measures had been taken to contain the problem and allow safe, quality production to resume. This was important for two reasons.

First, it forced the team to distinguish between the temporary stop-gap measure that was put in immediately and the true root-cause / countermeasure that could, and would, come later. Previously the culture had been that once this initial action were taken, things were good. We deliberately called these “containments” and reserved the word “countermeasure” for the thing that actually addressed root cause. This was just to avoid confusion, there is no dogma about it.

Second, it reminded the team of what (probably wasteful) activity they should be able to remove from the process when / if the countermeasure actually works. This helped keep these temporary fixes from growing roots.

You can get an idea of what a typical problem board looked like here:

Morning Market White Board

The columns were:

  • Date (initial date the problem was encountered)
  • Owner
  • Model (the product)
  • Part Number (what part was involved)
  • Description (of the part)
  • Problem (description of the problem)
  • A/B/C (which of the above categories the problem was now. Note that it can change as more is learned.
  • Containment Method
  • Root Cause (filled in when learned)
  • Countermeasure (best known being tried right now)
  • Due Date (when next action is due / reported)
  • Verified (how was the countermeasure verified as effective?)

A key point is that last column: Verified. The problem stays on the board until there is a verified countermeasure in place. That means they actually tested the countermeasure to make sure it worked. This is, for a lot of organizations, a big, big change. All too many take some action and “call it good.” Fire and forget. This little thing started shifting the culture of the organization toward checking things to make sure they did what they were thought to do.

Refining The Meeting

While this particular shop floor was not excessively loud, it was too loud for an effective meeting of more than 3-4 people. Rather than moving the meeting to a conference room, the team spent $50 at Wally World and bought a karaoke machine. This provided a nice, inexpensive P.A. system. It added the benefit that the microphone became a “talking stick” – it forced people to pay attention to one person at a time.

Developing Capability

The other gap in the process that emerged pretty quickly was the capability of the organization to solve problems. While there had been a Six Sigma program in place for quite a while, most of skill revolved around the kinds of problems that would classify as “black belt projects.” The basic troubleshooting and physical investigation skills were lacking.

After exploring a lot of options, the organization’s countermeasure was to adopt a standard packaged training program, give it to the people involved in working the problems, then expecting that they immediately start using the method. This, again, was a big change over most organization’s approach to training as “interesting.” In this case, the method was not only taught, it was adopted as a standard. That was a big help. A key lesson learned was that, rather than debate which “method” was better, just pick one and go. In the end, they are all pretty much the same, only the vocabulary is different.

Spreading The Concept

In this organization, the two biggest “hitters” every day were supplier part quality and supplier part shortages. This was pretty much a final-assembly and test only operation, so they were pretty vulnerable to supplier issues. This process drove a systematic approach to understand why the received part was defective vs. just replacing it. Eventually they started taking some of their quality assurance tools upstream and teaching them to key suppliers. Questions were asked such as “How can we verify this is a good part before the supplier ships it?” They also started acknowledging design and supplier capability (vs. just price and capacity) issues.

On the materials side, the supply chain people started their own morning market to work on the causes of shortages. I have covered their story here.

As they implemented their kanban system, morning markets sprang up in the warehouse to address their process breakdowns. Another one addressed the pick-and-delivery process that got kits to the line. Lost cards were addressed. Rather than just update a pick cart, there was interest in why they got it wrong in the first place, which ended up addressing bill-of-material issues, which, in turn, made the record more robust.

Managing The Priorities

Of course, at some point, the number of problems encountered can overwhelm the problem solvers. The next evolution replaced the white board with “problem solving strips.” These were strips of paper, a few inches high, the width of the white board, with the same columns on them (plus a little additional administrative information). This format let the team move problems around on the wall, group them, categorize them, and manage them better. Related issues could be grouped together and worked together. Supplier and internal workmanship issues could be grouped on the wall, making a good visual indication of where the issues were.

"Problem Strips" at the meeting.
“Problem Strips” at the meeting.

But those were all side effects. The key was managing the workload.

Any organization has a limited capacity to work on stuff. The previous method of assignment had been that every problem was assigned to someone on the first day it was reviewed. It became clear pretty quickly that the half a dozen people actually doing the work were getting a little sick of being chided for not making any progress on problems 3, 4 and 5 because they were working on 1 and 2. In effect, the organization was leaving the prioritization to the problem solvers, and then second guessing their choices. This is not respectful of people.

The countermeasure – developed by the shop floor production manager, was to put the problems on the strips discussed above. The reason she did it was to be able to maintain an “unassigned” queue.

Any problem which was not being worked on was in the unassigned queue on the wall. All of the problems were captured, all were visible to everyone, but they recognized they couldn’t work on everything at once. As a technical person became available, he would pull the next problem from the queue.

There were two great things about this. First, the production manager could reshuffle the queue anytime she wanted. Thus the next one in line was always the one that she felt was the most important. This could be discussed, but ultimately it was her decision. Second is that the queue became a visual indicator that compared the rate of discovering problems (into the queue) with the rate of solving problems (out of the queue). This was a great “Check” on the capacity and capability of the organization vs. what they needed to do.

There were two, and only two, valid reasons for a problem to bypass this process.

  1. There was a safety issue.
  2. A defect had escaped the factory and resulted in a customer complaint.

In these cases, someone would be assigned to work on it right away. The problem that had been “theirs” was “parked.” This acknowledged the priority, rather than just giving him something else to do and expecting everything else to get done too. (This is respect for people.)

Incorporation of Other Tools

Later on, a quality inspection standard was adopted. Rather than making this something new, it was incorporated into the problem solving process itself. When a defect was found, the first step was to assess the process against the standard for the robustness of countermeasures. Not surprisingly, there was always a pretty significant gap between the level of countermeasures mandated by the standard and what was actually in place. The countermeasure was to bring the process up to the standard.

The standard itself classified a potential defect based on its possible consequences. For each of four levels, it specified how robust countermeasures should be for preventing error, detecting defects, checking the process, secondary checks and overall process review. Because it called out, not only technical countermeasures, but leadership standard work, this process began driving other thinking into the organization.

Effect On Designs

As you might imagine, there were a fair number of issues that traced back to the design itself. While it may have been necessary to live with some of these, there was an active product development cycle ongoing for new models. Some of the design issues managed to get addressed in subsequent designs, making them easier to “get right” in manufacturing.

What Was Left Out

The things that got onto the board generally required a technical professional to work them. These were not trivial problems. In fact, at first, they didn’t even bother with anything that stopped the line for less than 10 minutes (meaning they could rework / repair the problem and ship a good unit). But even though they turned this threshold down over time, there were hundreds of little things that didn’t get on the radar. And they shouldn’t… at least not onto this radar.

The morning market should address the things that are outside the scope of the shop floor work teams to address.

Another organization I know addressed these small problems really well with their organized and directed daily kaizen activity. Every day they captured everything that delayed the work. Five second stoppages were getting on to their radar. Time was dedicated every day at the end of the shift for the Team Members to work on the little things that tripped them up. They had support and resources – leadership that helped them, a work area to try out ideas, tools and materials to make all of the little gadgets that helped them make things better. They didn’t waste their time painting the floor, making things pretty, etc. unless that had been a source of confusion or other cause of delay. Although the engineers did work on problems as well, they did not have the work structure described above.

I would love to see the effect in an organization that does all of this at once.

No A3’s?

With the “A3” as all the rage today, I am sure someone is asking this question while reading this. No. We knew about A3’s, but the “problem solving strips” served about 75% of the purpose. Not everything, but they worked. Would a more formal A3 documentation have worked better? Not sure. This isn’t dogma. It is about applying sound, well thought out methodology, then checking to see if it is working as expected.


Is all of this stuff in place today? Honestly? I don’t know. [Update: As of the end of 2011, this process is still going strong and is strongly embedded as “the way we do things” in their culture.]  And it was far from as perfect as I have described it. BUT organized problem solving made a huge difference in their performance, both tangible and intangible. In spite of huge pressure to source to low-labor areas, they are still in business. When I read “Chasing The Rabbit” I have to say that, in this case, they were almost there.

And finally, an epilogue:

This organization had a sister organization just across an alley – literally a 3 minute walk away. The sister organization was a poster-child for a “management by measurement” culture. The leader manager person in charge sincerely believed that, if only he could incorporate the right measurements into his manager’s performance reviews, they would work together and do the right things. You can guess the result, but might not guess that this management team described themselves as “dysfunctional.” They tried to put in a “morning market” (as it was actually mandated to have one – something else that doesn’t work, by the way). There were some differences.

In the one that worked, top leaders showed up. They expected functional leaders to show up. The people solving the problems showed up. The meeting was facilitated by the assembly manager or the operations manager. After the meeting people stayed on the shop floor and worked on problems. Calendars were blocked out (which worked because this was a calendar driven culture) for shop floor problem solving. Over time the manufacturing engineers got to know the assemblers pretty well.

In the one that didn’t work, the meeting was facilitated conducted by a quality department staffer. The manufacturing engineers had other priorities because “they weren’t being measured on solving problems.” After the meeting, everyone went back to their desks and resumed what they had been doing.

There were a lot of other issues as well, but the bottom line is that “problem solving” took hold as “the way we do things” in one organization, and was regarded as yet another task in the other.

About 8 months into this, as they were trying, yet again, to get a kanban going, a group of supervisors came across the alley to see what their neighbors were doing. What they saw was not only the mechanics of moving cards and parts, but the process of managing problems. And the result of managing problems was that they saw problems as pointing them to where they needed to gain more understanding rather than problems as excuses. One of the supervisors later came to me, visibly shaken, with the quote “I now realize that these people work together in a fundamentally different way.” And that, in the end, was the result.

In the end? The organization in this story is still in business, still manufacturing things in a “high cost labor” market. The other one was closed down and outsourced in 2005.

Chasing the Rabbit – Steven Spear

Steven Spear has been on the cutting edge of research about what makes exceptional organizations exceptional for over 10 years. The landmark paper “Decoding the DNA of the Toyota Production System” summarized his PhD research on Toyota. His dissertation is the 5th most popular publication on ProQuest, the online source for academic work.

His recent book, “Chasing the Rabbit” brings his work from academic publications such as “The Harvard Business Review” into the mainstream business press.

One of the differences between Steven Spear and most other experts on the TPS is that Spear is, first and foremost, a student of theory. He does not make casual observation and render an opinion. Rather, he develops a formal theory then seeks to continuously test it against facts.

During his years of research into high-performing organizations, Spear has found consistent examples where – all conditions being equal; where products are a commodity; where the playing field is level – one organization in the field consistently outperforms the others over time. Spear builds on a metaphor of the front-runner in the Boston Marathon seemingly coasting to victory while the pack following him struggles, fights and elbows for the honor of coming in second. He calls these high-performing or high-velocity organizations the “rabbits.”

Spear’s theory of what differentiates the “rabbit” organizations is, ironically, that they consistently apply theory and theory testing into their management systems. I suppose it took someone who was well trained in the “theory of theory” to really see that. He has extended the application into a general set of principles that he has found across all organizations which excel. Further, he has found that mediocre operations which begin to apply the principles can rapidly and dramatically improve their performance.

He makes his case through example after example of organizations outperforming their peers and comps.

I am not going to repeat those stories here – you can click on the link above, buy the book, and read them for yourself. Rather, over a series of posts, I am going to go through some of the points in the book that struck me, share my mental notes, cite examples where I believe his theory holds, and hopefully spark some discussion.

A3 by PowerPoint

aarrgh! all of the purists say! Death by PowerPoint. Yup.

But one of today’s realities is that many managers expect to be “briefed” and expect it to be done in a conference room with a projector and… PowerPoint.

Getting them to sit down and go through a single sheet of A3 paper is going to be a stretch at best. So let me propose an interim.

Five slides, six at the most.

No fancy headings, logos, etc. They take up space and distract from the message.

Simple text. No animation. Pictures, graphs to make the points.

The slides are:

Background / Current Condition

Briefly cover where we are, and why we are talking about this right now.

Back up your assertions with data and facts. Note that, in my context, a “fact” is something you can see, observe, sense, touch. The data must be explained by the facts.


What is this going to look like when we are successful?

The target is binary. It is verifiable as “met” or “not met.” It does not include vague words like “improved” or “reduced” which are subject to interpretation.


What is keeping us from hitting the target right now? What is in the way? What must be solved, what barrier must be cleared, what factor must be eliminated?

Clearly demonstrate that dealing with these issues will allow reaching the target.

Countermeasures / Implementation

What actions will be taken to deal with the issues or shortcomings?

When will they be taken?

Who will take them?

When will they be checked for successful implementation?

For each one, what is the predicted effect if it works as planned?

How will you check the actual effect?

Do the cumulative predicted effects of your countermeasures add up to enough to close the gap and reach the target?

If not, then what else are you going to do?

Results / Follow-Up

What actually happened?

If When things got off track, what is the recovery / correction plan?

If When actual results were different than planned, what else are you going to do?

Did you reach the target? If not, what else are you going to do?

It’s the thinking, not the format!

Do the headers change sometimes? Sure, but the intent is:

What is happening?

What do you want to happen?

What is the gap?

What will you do to get it there, and how will you check that:

  • You did you you planned.
  • It worked like you expected?

Do it.

Check it.

Fix it.



This is a leader’s tool

If it is done well, and done correctly, it is done the way John Shook describes it in his new book Managing to Learn. But don’t confuse the size of the paper with the structure of the thinking. Get that right. Worry about the sheet of paper later if you must.

When encountering resistance, a good teacher knows what things can be left for later, and which ones are critical to get right.

Not Just Asking Why? – Five Investigations

I hit around this issue in the past, but with the recent publication of John Shook’s new book Managing to Learn, I felt the need to go into it again.

In the text, Shook’s coverage of root cause investigation is very thorough. He tells the story of each “Why?” question triggering another round of investigation.

But in a full page sidebar, he uses the example from Taiichi Ohno’s classic book Toyota Production System: Beyond Large-Scale Production. For those of you following at home, the original example is on page 17 of Ohno’s book, and the reference to it is on page 47 of Managing to Learn.

Quoting from the books:

  1. Why did the machine stop?
    There was an overload and the fuse blew.
  2. Why was there an overload?
    The bearing was not sufficiently lubricated.
  3. Why was it not lubricated sufficiently?
    The lubrication pump was not working sufficiently.
  4. Why was it not pumping sufficiently?
    The shaft of the pump was warn and rattling.
  5. Why was the shaft worn out?
    There was no strainer attached, and metal scrap got in.

The conclusion is that the lack of a strainer is the root cause of the machine stoppage. (“For the want of a nail…“)

This line of thinking is all well and good after the chain is understood. Unfortunately it gives the impression that the root cause of a problem can be reached simply by repeatedly asking “Why?” and writing down the answers. I know this because I have personally experienced well-meaning-but-ignorant consultants who have done exactly that on a flip chart with a team trying to solve a problem.

I have heard “Just ask why five times” as a method, proposed in contrast to more rigorous methods.

It ain’t that simple, folks.

Let’s look at this example.

Why did the machine stop? What I know right now is that the machine isn’t running. Although I can get to the “blown fuse” fairly quickly, let’s not confuse the first or second thing I would check with a process of systematically eliminating other possibilities. The simple fact is that I would check the fuse fairly quickly because I can’t check everything at once, and because I am going to check things more-or-less in order of simplicity. But I am systematically ruling out loss of power at the feed, a physical problem (such as a broken connection), a problem in the control circuitry, and a host of other possible issues. In short, I must investigate a “loss of electrical power” until I reach the conclusion that it is a blown fuse.

Ohno skips a bit by going directly to the cause of the blown fuse as an overload, but it is going to take a little more investigation to get to that conclusion. Coming forward a few decades from when that book was written, I would probably reset the breaker and see if it trips again. But even then, I haven’t ruled out a bad fuse / breaker. Determining, for sure, that it is an overload condition is going to take a little more troubleshooting. A multi-meter would be much more useful than a flip chart at this point.

Once I am pretty certain I am dealing with an overload condition, then I can ask what is causing it.

Why was there an overload? Well lots of things can cause an overload. Something is putting drag on this notional motor. Maybe it was a bearing problem in the motor. Maybe a bearing elsewhere. Maybe a gear has locked up. Is this even an overloaded motor? Or is it an overloaded circuit? Eventually, after systematically checking and testing, I find the bad bearing. Now – Why did the bearing fail? How do I know it is lack of lubrication? Hopefully it is obvious, but there may be some other things I need to look at. Is there a flow of lubricant into the bearing? If that is normal, I need to look elsewhere. But there is not a normal flow of lubricant, so for now I can reasonably assume that lubrication is the problem.

Why was it not lubricated sufficiently? If the lubricant is not reaching the bearing, Why is there insufficient lubricant flow?

Is the sump dry? Is the intake clear? Is the line kinked, clogged or leaking? Is it clear? How do I know? As I work my way upstream, physically checking, I’ll eventually reach the pump that is complaining.

Why was it not pumping sufficiently? At this point, I am probably replacing the pump. But why the pump failed in the first place is a reasonable question to be asking. And only upon physical examination of the old pump am I going to find the worn and rattling shaft. But I am curious, so I look rather than just scrapping the pump and replacing it.

Why was the shaft worn out? Because the scope of investigation is narrowing, things get a little easier. Taking the old pump apart is going to reveal that the shaft is bound up with metal scrap. That takes me through a few more “Why?” questions – how could this get in here? And that is the point where I see no strainer on the intake.

Now, obviously, I made all of this up. But here is my point:

We, the teachers of others, do our students a major disservice when we over-simplify things. “Ask why five times” is very easy for people to take it out of context and try to apply literally. Unless the problem is very simple, it just doesn’t work, and that leaves them:

  • Frustrated.
  • (Correctly) believing that anyone who thinks the real world is this simple has never had to deal with it.

Ohno certainly dealt in the real world. He also uses metaphors. We should caution ourselves not to take everything as 100% literal. Ohno’s point is summed up in the last paragraph of this section of his book on page 18 when he concludes:

In a production plant operation, data are highly regarded — but I consider facts to be even more important. When a problem arises, if our search for the cause is not thorough, the actions taken can be out of focus. That is why we repeatedly ask why. This is the scientific basis of the Toyota system.  [emphasis added]

The scientific method generates understanding through repeated hypothesis testing. A scientist ask “Why?” then fits a possible answer to the facts as he understands them and then asks “What else would be true if I am right?” and builds an experiment (or investigates) to verify, or refute, his thinking. This is how to ask “Why” and this is what you should do five times.