The business of data management has no lack of descriptive metaphors in order to help explain complex topics or technologies. Some of them are more helpful than others. One that is particularly interesting and somewhat confounding is the "Data Lake." It is interesting in that it captures very quickly the problem space surrounding that other metaphor "Big Data" as well as the confluence of quite a number of other factors and trends that have been driving us towards something fairly new for Data Management.
“Data Lake” is confounding because it is only conceptual in nature and doesn't imply any sort of framework or approach. In other words, 'how can one attain Nirvana without understanding the path?' It is even more confounding than many other metaphors in that it can also be interpreted in a number of different and potentially contradictory ways.
So, how do we find better ways to describe our evolving data management universe - ways that help us achieve clarity and that can lead to tangible solutions? One has to ask whether a simple term or phrase could even begin to accomplish something like that. Before we try to answer that, let's step back and look at the problem space.
Understanding the Problem Space:
Issue 1 - There is a huge difference when considering how to manage a large volume of relatively homogeneous and simple data as opposed to managing a large volume of very diverse and potentially complex data. In other words, the Data Lake can be wide but shallow or narrow and deep or wide and deep… it could also be murky, filled with fish and algae etc. – some examples of why that particular metaphor is so problematic...
Issue 2 - Trying to apply traditional Data Governance and / or holistic control over a Data Lake (which many organizations are recommending) is a big mistake. Why? Because Data Governance has never yet been successfully extended to most enterprises in its current form - multiplying its complexity by an unknown factor is only likely to turn people off from the notion of exercising any control at all. In other words, why do we think one size ought to fit all? Where did this notion come from?
Issue 3 - Data is Subjective; the Lake, Big Data and other related trends are forcing us to recognize that subjective and dynamic information can be a part of enterprise data management and that as a result Data Management as a field / discipline will change significantly over the coming years. In other words, some data is really subjective and other types are at least attempting to be objective and sometimes this is the same data - this also implies unique management considerations for each type of data (or data location).
Introducing a New Metaphor:
So, perhaps what we need is a new metaphor to help better describe how Data Management is evolving and begin the process of building a framework which recognizes that data is not really some monolithic bucket of information. The metaphor I have in mind is "Data Zones" - it doesn't represent any particular technology per se but rather presents a view as to how data might be managed in the context of all emerging and existing data focused technologies - in essence it begins to address the issue of data control and frames it as spectrum rather than through a singular approach.
A "zone of control" translates into potential governance and management workflows or design patterns (specific to each zone). It also must be understood that data can travel from one zone to another - so it is important to understand those transitions and that a particular zone doesn't always equate to the data (as the lifecycle for any given entity or element may extend across zones). There are already other parallels to this view within IT which can help to operationalize or adopt the notion of Data Zones quickly. For example, in information security we typically view the location of assets and data as zones, even though we might not always refer to it that way. In most enterprises, we have internal, external and DMZ zones wherein different security controls are applied. In fact, some of that view overlays nicely with Data Zones in the context of controls and where data resides at any given point in time.
How does this help?
The notion of Data Zones helps to apply structure to what has become a very open-ended trend. In fact, not too many years ago this same problem space (or a variant of it) was often referred to as Structured vs. Unstructured Data. It is debatable these days whether Big Data is structured or not, but what isn’t debatable is the fact that most enterprise organizations haven’t determined how that Big Data fits in the context of existing data governance or management paradigms (in terms of policy, workflow, design etc.). And many of the technologies that were or are still considered unstructured are still outside of typical data management paradigms. Much of the effort surrounding Big Data has occurred in its own silo or sandbox, but as with any maturing technology, at some point it must be brought into the fold.
Data Zones begin to attack the problem space in the following ways:
- It provides a foundation for defining ‘targeted’ data policy and principles.
- It implicitly guides Data Governance workflows / processes. (i.e. the lower the zone, the stricter the control)
- It allows for flexible management (e.g. data can move in and out of zones or be viewed differently in different zones)
- It implicitly guides Data Modeling expectations (i.e. the lower the zone, the higher the modeling expectation)
- It blends naturally into information security control paradigms, thus making alignment between data management and data security relatively seamless.
The Benefits of Data Zones as a framework and concept include:
- It gives us a an immediate tool to help demystify the Data Lake and Big Data as well as other unstructured data technologies which may not currently be considered within the context of Data Management.
- It immediately dismisses the fallacy of one size fits all governance.
- It provides a flexible yet managed paradigm - in other words zones don’t just provide guidance, rather it is guidance that can be easily tailored and evolves along with the enterprise.
- It is actionable and clear - it allows for real-world design and configuration as well as process workflows. It gives us a starting point…
- It helps to more easily discern the relationships between various types of data and technologies and to better understand data lifecycles (within complex integrated ecosystems).
So, is “Data Zones” just another murky metaphor ? You decide.
copyright 2015, Stephen Lahanas