In my last post I started to look at the difficulties of handling data in Industry 4.0. I looked especially at the complexity and the often underestimated problem of merging data from different sources or machines. This second post of this two-post series finishes up this topic and will look at the also important and often underestimated task of cleaning up the data.
Cleaning Up the Data
Once you have gathered all your data from different sources, you need to clean up the data. For example, if you collect data on machine stops, one machine may define a stop as a turned-off machine, whereas another machine defines a stop as any time not producing, whereas another machine may define a stop as merely waiting for parts. If you merge data from different machines, such different definitions can make or break any analysis. Usually they are a lot of work to sort out and clean up. I have the feeling that some non-data people believe an analysis is merely putting an Excel formula over the data, but in my experience these types of cleanup can take much, MUCH more time than the actual analysis.
Maybe I am an optimist, but I assume all data to have a meaning. Unfortunately, the meaning is by no means always clear, and even the one you actually need. Having data does not always mean having useful data. Often, a lot of work has to be done to turn a random collection of data points into some useful coherent data set.
And this does not even include the possibility of faulty data. This even predates computers. For example, Charles Babbage (mathematician, philosopher, inventor, and mechanical engineer, 1791–1871) wrote, On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. This holds still true nowadays: Garbage in, garbage out!
Maybe you have worked with an ERP software that has lots of data. But all too often, an analysis is completely flawed merely because the data in it is wrong. One intern in my former company once had to analyze the space needed for storage. He pulled the size of the storage for the different parts from the ERP system and analyzed… completely overlooking that the data was garbage. O-rings were stored on pallets, engine blocks in small boxes, and the data was a mess. He had to manually go through all the data to clean it up.
By the way, having good data is not cheap. It is said that automotive companies pay around €50 000 just to maintain the data for one part number over its lifetime (not development or production, merely keeping the data up to date). Due to the complexity of their products, automotive companies are usually better at this, but even makers of washing machines or bicycles pay around €8000 per part number to keep the data at least somewhat straight. It is hard to estimate how much it costs to keep the industry 4.0 data for a plant aligned, but I believe the cost to be eye-popping. Again, not for installing sensors, hardware, or for software licenses, but merely to keep the data in the software at least somewhat clean. And this cost is usually not on the radar of most promoters of Industry 4.0…
Doing All of That Continuously
Only after you have merged the data and leaned it up (weeks later) you can do a proper analysis. However, the power of Industry 4.0 is not in having the data analyzed half a year later; the power lies in having the analysis real time or at least very soon. Overall, you have two options. You can merge and clean up the data whenever you need an analysis. Or you set up the system to continuously clean and merge the data automatically. Doing it whenever needed is easier, but you have to do it again for every new analysis. Doing it continuously is much harder, as you need to program and set up the data collection and the data processing; but once it is cleaned up, it does not need to be cleaned again, and hopefully the data streams are usable as is (at least until the next machine is added or another machine or sensor is changed). Just like cleaning your home, cleaning it once is easy, but keeping it clean all the time is much more difficult…
Understanding the Data
Finally, now you can start to analyze and understand the data. This can be anything from an Excel file with a manual analysis to a analysis of variance (ANOVA) to artificial intelligence (AI). This is where the Industry 4.0 people get excited again, but without all the merging and cleaning, it would be a futile exercise. Sure, the algorithm would give you a number, and if management wants a number, they will get a number. But… is it a correct number or is it widely off? Again, garbage in, garbage out! If you want data-based decision making, your data better be good! Are you leading the data, or are you led by the data? On a side note, especially in lean it is often difficult to get good data on the benefits of lean, even though they are there. Besides, a lot of your data will never get used. I had one example from maintenance, where they found out that they use less than 15% of the data they have. This is also somewhat understandable, as for example an automotive factory produces over 20GB of data per day, with an increasing trend.
Some (Theoretical) Attempts to Tackle the Problem
This problem of messy data in different formats is not new. For example, there is the Reference Architectural Model Industrie 4.0 (RAMI 4.0). This is a construct trying to organize the many different data levels related to Industry 4.0. It is a top-down approach to manage data, and like many top-down approaches in Industry 4.0, it never seems to reach the bottom for usefulness. I have not personally worked with RAMI 4.0, but I am not holding my breath. This problem is also related to the I4.0 Maturity Index. These initiatives seem to have originated in Germany, and suffer from the German Industry 4.0 tendency to wrap a theoretical construct around something but neglect the actual functionality.
Anyway, my summary is that using data, especially from different sources, is quite a pain. And again, I recommend you to keep your Industry 4.0 approaches small and the scope of the problem manageable. If you try to fix everything, you won’t get anything done. If you don’t even know what you are trying to fix, you will just get something that probably is not really useful except for glossy public relation fliers. Now, go out, look for a manageable problem, do the diligent PDCA, and organize your industry!
P.S.: Many thanks to Xie Xuan for some pointers 🙂
This discussion of the problem is spot-on!
I have seen too many firms leaping at industry 4.0 for solutions when many of the issues are ‘obvious’issues that can be more quickly and efficiently dealt with lean tools and conventional statistics.
Marvelous post. Solve those questions and you solve the administration 3.0. challenges also.
You are describing in this post what BI, big data, ml analytics is about.