I recently finished reading a book by Sunil Soares: Big Data Governance. The whole “Big Data” topic has been exploding – so I’ve done a lot of research into the area. With my background in data architecture (which inherently recognizes the value of data) the concept of applying Data Governance principles to Big Data was interesting. So I broke down and spent the $30+ dollars on the Kindle book so I could better absorb this concept for possible use in my professional career.
Sunil Soares used to work for IBM and was one of the authors of a free eBook: the IBM Data Governance Unified Process. I read that eBook previously and was able to absorb some of the material (I need to re-read to get more out of it). It felt like some of the information in this book was very similar to what was written in that free eBook (so if you’re cheap like me you may want to start there).
As usual I’m going to talk about what themes I saw/learned from reading this book (which could have come from other literature by him and others). The book itself has a pretty organized outline of the steps he would recommend for various data governance principles relating to Big Data. You can get most of that by simply reading the Table of Contents of the book so I’m not going to repeat it here.
I believe that the “Big Data” phenomenon itself has clearly demonstrated the value of data itself -of what value it can provide. That said – it’s the results of insight where the real business value is. Sunil Soares says: “its value must be truly understood and unlocked by deriving insights that are revealed through analysis and then translating those insights into information, knowledge, and ultimately action.” Companies are finding profound ways to take data – from many sources – and derive insight they never could before (either it was impossible or too expensive).
What is Big Data Governance?
Sunil Soares defines Big Data Governance as follows: “Big data governance is part of a broader information governance program that formulates policy relating to the optimization, privacy, and monetization of big data by aligning the objectives.” This definition implies that Big Data Governance sits in the context of an already existing Data Governance program. Therefore it seems that the author is saying that most of the principles for Data Governance in general would apply to Big Data Governance. I’m not sure that I agree with that – but we shouldn’t ignore what has been learned in traditional Data Governance programs.
Why Big Data Governance?
I think the first question to answer is why you would even want Data Governance as part of a Big Data program. Some of the promise of Big Data analytics is that you don’t have to do all the traditional Data Warehouse work to get results. The concept is that you can load the raw data into your Hadoop environment and perform some advanced analysis and wala – out comes meaningful results. You don’t need to perform rationalization, cleaning, summarizing, etc. – you just work with the data as it is.
While some of that is true – it’s somewhat of a misleading picture. It’s true that one of the advantages of putting all the data into the Hadoop environment is that you don’t have to rely on a sample of the data (which may not represent well the whole). For example: if you we’re provisioning bandwidth for a set of servers and used an averaged sample of the currently used bandwidth you may miss the occasional spikes in bandwidth that really drive what’s needed [Scott Kahler explains this better than I did in this keynote video: http://www.kcitp.com/2012/09/03/big-data-kansas-city-technology-events/].
The other major difference is including far more variety of sources in your analysis – including segments of unstructured text. The technology now allows us to efficiently process through far more volumes and variety of data than we could before. We’ve also advanced in how we can process text and other varieties of data – in terms of our algorithms and other advanced processing. The ability to combine so much data together to get a picture is fascinating.
That said – it’s not quite that easy or simple. When we say unstructured data what we often mean is that there is some unstructured text within an otherwise structured container. That structure may not be as rigid – in that maybe not everything is present or it’s more variable -but it still has a structure. Therefore an effort needs to be made to understand that data – especially in terms of it’s reliability.
Here are a few examples of data that may not be what it seems:
- User names in Social Media – they’re not always a real name. In some social media sites there is no guarantee that the user name is a person’s real name (or any real name). This is significant as one of the goals is often to tie a master customer record to their social media data.
- Sunil Soares mentioned the term “unique visitors” (in the context of clickstream data). One site/source may measure the # of unique visitors a week vs. another measures it within a month. If you directly compared this data without addressing this you would get skewed results.
- Let’s say we have a measurement that represents the average temperature for the last hour. If one measurement was a rolling average (taking into a large # of previous values) vs. another is only for that hour.
- Location data – does each data source assign the same meaning to the same value? If you matched data solely on the values would you really be matching the same location?
- Another of Sunil Soare’s examples was sensor and part terminology in railroads. If we can determine that sensor event #282 typically occurs before part #339 fails does that part have the same # in different cars/engines? Do the different sensors produce the same code for the same event? Would we need some type of cross-reference table to map these together?
- At a higher level consider whether the same data is being pulled into your Hadoop environment multiple times? Is Data Source Q really the same as Data Source A? Did we end up wasting storage space, transmission and possibly licensing cost on duplicate data?
There are another dimensions of concern that are not technical – but a function of the complex and inter-related environment we all live in:
- Privacy – despite fact some think that privacy is dead there are serious concerns around privacy and Big Data.
- Consider who really owns the data? Is it yours or the customer? Most social media sites will tell you that the data is the customer’s – you can’t own it (you may even have to delete it if they ask you).
- Are you, by combining data, create new types of sensitive data that didn’t exist before
- Have you built safeguards into your Big Data platform to control who has access to what (security is not part of the native Hadoop platform)
- Regulatory – regulatory agencies don’t care how the sensitive data is stored (i.e. Hadoop) – they will hold you accountable regardless.
- Are you in a highly regulated industry such as HealthCare?
- Are you dealing with sensitive corporate data governed by regulation?
- Do you have industry constraints – such as PCI (credit cards)?
- Do you know what the regulations are in each country you operate in (they are often different)?
- Even if something is legal – it may not look very good in the eyes of your customers or partners
- You must weight the risk of the impact to your reputation vs. the revenue potential
What is different about Big Data Governance?
So the next question is whether “Big Data Governance” is really any different than traditional governance for operational or enterprise reporting systems. I believe it can and should be different – as it’s often for a different purpose. Sunil Soares puts this well:
Big data needs to be “good enough” because poor data quality does not necessarily impede the analytics that are required to derive business insights.
You may have heard of ETL (Extract, Transform, Load) but now there is a new term: ELT (Extract, Load, Transform). At it’s simplest the concept is that the data is loaded in its raw form and then transformed – not the other way around. This is possible due to the fact we can both afford to load the raw data and have the computing power to transform it in place. Therefore data quality may be enforced on the fly – instead of before the data is at rest. So the focus is on doing a reasonable effort on the data that’s imported instead of making it pristine before it’s loaded.
How do I implement Big Data Governance?
The next question is then what’s a framework for implementing Big Data Governance. Here are some of my thoughts (hopefully organized enough to be useful):
- Know your Data.
- Catalog your internal and external data. Other than for a sandbox don’t let data into your Big Data platform unless it’s cataloged
- Understand your data – not in complete detail but the overall quality, time scale, etc.
- Document some of the key fields within your data – ones that aren’t intuitive and that are key to using the data effectively.
- Develop a method to document and share this metadata
- Know your organization and your platform.
- Understand who can be involved in data quality – both at it’s source and while it’s in your Big Data platform
- Understand what your platform can do – good and bad.
- Understand constraints, regulations, etc. – especially by region.
- Understand your legal, ethical and internal constraints.
- Evaluate these by region – as they can differ greatly
- Understand what your organization’s commitment level is regarding platform and people resources
- Determine what data needs to be cleaned up and what needs to be protected.
- Flag data that needs to be cleaned and why it needs to be cleaned
- Flag data that is sensitive and needs to be protected.
- Develop a method to document and share this metadata
- Determine how and when to clean and protect the data.
- Will you clean your data before it hits your Big Data Platform, after it hits it or in real time?
- Determine strategies for cleaning that data
- Determine strategies for protecting sensitive data and overall security schemes
- Evaluate how you are doing on a regular basis.
- Establish routine meetings (quarterly, yearly, etc.) to evaluate how things are going
- Create the expectation that this is a process and that changes will be common
In conclusion I believe introducing Data Governance into a Big Data environment is a worthwhile choice.