Big Data Architecture: How to Build Order Out Of Chaos

13th October, 2016 by

There’s a central contradiction at the heart of big data governance: the rigid classification and control of information that typifies most governance initiatives seems wholly at odds with the diverse, distributed, unstructured nature of big data architecture. Yet there’s no getting away from the fact that governance is essential, for both regulatory and business reasons. Find out why, here:


Big Data Architecture: Supporting Walls And Pillars

Irrespective of whether data is big, small, structured or unstructured, an organization still needs to be able to guarantee:

  •      Privacy/confidentiality – any data which is of a personal or commercially sensitive nature must be identified, labelled as such and appropriately protected from prying eyes. Failure to do so could lead to legal difficulties, financial loss and/or reputational damage. This is the equivalent of a structural supporting wall.
  •      Control – at all times you need to know what data you’re handling, how it’s being handled, who (and what) is accessing it and where it resides. This is essential both for good management and in order to be able to comply with regulatory and audit requirements (particularly in heavily regulated sectors like finance and the public sector). You also need a strategy for data lifecycle management, so that you’re not keeping non-essential data longer than you need.
  •      Integrity – this is essential to ensure that bad data doesn’t skew the results of any big data analysis and lead to inaccurate insights and, in consequence, poor business decisions.
  •      Availability – information must be available to individuals and applications at the time it is needed in order that workflow is not interrupted and business competitiveness is maintained.
  •      Manageability – governance frameworks and processes for big data must be kept as lightweight and agile as possible since the size and complexity of big data means governance costs can, if you’re not very careful, easily spiral out of control and wipe out any business value.

The challenge is further compounded by the fact that, as yet, no standard framework or big data architecture has emerged for governing big data. In addition, rules and regulations over things like privacy and data protection vary among different jurisdictions, and are in a constant state of flux.

Getting It Right: Some Key Considerations

Ensure full business support:

One of the key challenges with implementing data governance is ensuring the whole business understands the need for it, fully buys into it and rigorously follows the processes you lay down. Typically, data governance fails because people push back against proposals which they see as stifling their ability to do their jobs quickly and efficiently. Mitigate this by involving people from all your lines of business in the creation and oversight of a big data architecture plan. This way it’s more likely to be developed in line with business needs and priorities. Identify ‘data governance champions’ from each line of business to evangelize its critical importance to the rest of the organization. Some businesses find it useful to set up a steering committee comprising people from all business areas to set the strategy, and a data governance office and service team sitting under them to oversee and implement the strategy.

Only govern to the extent necessary:

As the volume and variety of big data grows, so governance processes become more complex and unwieldy. As a result you don’t want to impose excessive governance on big data where it’s not needed. For example, if you’re pulling in data from social networks that’s publicly available, this won’t require the same level of security or oversight as internal transaction data or private information on customers and prospects. It’s business value will also rapidly deplete, so ensure you’re not storing it (or any other data) longer than you need, otherwise the cost of the resources needed to do so may outweigh any benefits it gives to the organization.

Use meaningful metadata and a flexible taxonomy:

All big data architecture relies on having descriptive metadata that tells you about individual data objects, and a standard taxonomy that allows you to place different types of data in the appropriate classifications. While traditional relational database stores and data warehouses lend themselves to this, it isn’t always so easy to do with a lake of big data. Consider employing a looser structure that better fits with the flat big data architecture – e.g. tagging data objects with one or more standard keywords rather than forcing them into a tree-style hierarchy, and not labelling specific data objects with more items of metadata than you need to meet the requirements of your strategy.

Automate as much as you can:

To maintain your business agility, you don’t want to bog people down with lots of additional tasks in order to comply with your data governance policies – whether that’s your IT teams or people in other areas of the business. Keep governance as lightweight as possible without opening up the business to unacceptable risks. Automation can help here. For example:

  •      Tools such as Apache Atlas, a data governance and metadata framework for Hadoop, can help index big data automatically, applying metadata and taxonomical classifications according to your policies. It can also automatically detect and reject duplicate data, as well as generate necessary compliance reports.
  •      Automated authentication and verification can ensure people and systems authorized to access particular data sets can do so without having to fiddle about with passwords and authentication tokens.
  •      Data integrity can be eased by using automated systems to continually analyze change logs and raise the alert if there are any unusual events or unauthorized changes.
  •      Availability can be better assured by automatically monitoring systems and processes against pre-defined SLAs so you can be alerted instantly of any potential service disruptions. You can also automate much of the disaster recovery process, for example putting in place automated backup and failover in the event of systems being brought down by some natural, technical or man-made disaster.

The above list is certainly not exhaustive. Every organization will have its own priorities and required levels of complexity for big data governance, based on its particular business focus, the precise nature of its big data architecture and any existing data governance framework it may be using. But by keeping in mind the broad principles we’ve highlighted here, you are far more likely to be able to find a way through the chaos and start gaining meaningful control over your big data.