It’s no exaggeration to say that big data is big business nowadays. Research by IDC suggests the big data analytics market is growing at 11 per cent each year, and will soon be worth more than $200 billion. That’s not entirely surprising when you consider the predictions for future data volumes, with the ten zettabytes of data generated in 2015 expected to increase 18-fold by 2025.
What will we analyze and evaluate in this big data tutorial?
In this big data tutorial, we analyze the industries set to reap the benefits of this burgeoning market. We also evaluate the respective merits of three successful big data frameworks, all of which share the same DNA but approach their tasks in very different ways…
Applying a definition to big data
Big data is defined in dictionaries as huge data sets where computational analysis identifies patterns, trends and associations. In some areas it will help companies shift more stock and target consumers more precisely. In others, it will power increasingly autonomous robotics such as self-driving vehicles. And elsewhere still, big data will underpin a revolution in knowledge, helping medical professionals to proactively monitor our health, and automating our homes through learned behaviors.
Today, tutorials and lectures on big data tend to conflate its use with social media platforms as the historic divide between data analysis and data storage is bridged. Curated follower suggestions and non-chronological timelines demonstrate big data molding our online experiences, according to interpretations of our preferences. And while there are strict security and compliance guidelines surrounding the use of big data, people’s willingness to share the minutiae of their lives on social media platforms represents a goldmine of information that forward-thinking companies are enthusiastically exploiting.
Any device that uploads information to the cloud is contributing to the burgeoning volume of big data being generated by billions of formerly passive offline devices. This information has to be processed to be of use, otherwise the overall aim of improving knowledge won’t be met. But with unprecedented volumes of information being created by domestic devices and new technology like wearables, it’s fairly obvious that traditional methods of data interrogation won’t be up to the job. Large-scale automation is required instead.
AI automation for the people
Automation is central to the future of artificial intelligence, with high hopes surrounding the Deep Learning subset of machine learning. Inspired by the neural networks of the human brain, Deep Learning has seen programmers and engineers attempting to develop a computerized form of learning modeled on human interpretation and analysis. Today’s tech giants are piling into Deep Learning, from Google and Microsoft to IBM and Intel. Their focus ranges from natural language processing and emotion detection to audience profiling.
Yet despite the considerable investment being made by these industry heavyweights, most big data analytics currently relies on free content from the Apache Software Foundation. This non-profit software development organization has created the building blocks of choice for most big data processing tools, which are interchangeably referred to as engines or frameworks. Requiring entirely new methods of data processing, these frameworks identify surface patterns and complex interactions alike.
While this will be invaluable for automation, it’s equally valuable for companies wanting to optimize their sales and marketing activities. However, big data analytics requires more than simply harvesting data. It demands a holistic approach to sorting and treating information with a view to solving problems, or gaining insights. Information can be processed as it comes in (known as streaming) or from storage (referred to as batch processing). Hybrid systems can do both, potentially representing the most powerful solution.
Big data processing platform wars
Apache Hadoop has been the market leader for some time, though its recent decline is inversely proportional to its burgeoning Spark cousin. And then there’s Storm – something of an outlier in the Apache family, but well worth considering by companies with large volumes of live information to interrogate and analyze in real time. However, the obvious place to start is with the much-hyped and widely-discussed market leader, whose name was historically used as shorthand for big data…
Batch processing – Apache Hadoop.
Named after a toy elephant belonging to the lead developer’s young son, and initially developed for a now-obsolete search engine, Hadoop was launched in 2011. Back then, big data was a concept rarely discussed outside tech conferences, so this Java-based framework was able to dominate the nascent market for storing and processing large offline data sets.
Hadoop’s origins date back to punch cards used to store information relating to the 1890 US Census, though it differs from its paper ancestors in handling thousands of terabytes of data among an equivalent number of commodity hardware nodes.
- Hadoop’s basic processing capabilities can be enhanced with a number of complementary programs. Examples include workflow scheduling system Oozle, and relational database engine Phoenix.
- Hadoop is a cost-effective storage solution for raw data, which doesn’t need to be compressed or deleted before it’s processed.
- Information can be stored in vast quantities across multiple locations, preventing individual nodes or servers becoming overloaded.
- Data is stored on a distributed file system, which enables terabytes of data to be processed within a matter of minutes. Having raw data on the same servers as processing tools effectively eliminates transfer times that would be incurred with a less efficient setup.
- The open source nature of Hadoop makes it easy for companies to experiment with big data, without having to make a significant up-front investment in enterprise software.
- A wide variety of source data can be processed, without requiring conversion into a single compatible format. This is great for omnichannel contact centers, where unstructured data may be acquired via text, email, phone call or chatbot functionality.
- Redundancy is built into Hadoop’s operations, with automatic data duplication onto permanent storage preventing accidental data loss. That gives P/I/SaaS providers confidence in their offerings, and it should reassure their clients as well.
- It’s designed to handle batch processing, but complementary technologies now support streaming data processing as well.
- Open source software is constantly evolving. That’s good for development, but bad in that new parts of the ecosystem can evolve without warning – and existing software extensions occasionally vanish overnight. Tool stacks from distribution companies can alleviate this to an extent, but there may still be unexpected surprises around the corner.
- Any big data tutorials will mention concerns about Hadoop’s security since it was built entirely on Java – a platform whose vulnerabilities are easily (and regularly) exploited by cyber criminals.
- While it may be ideal for global companies with multichannel data streams, Hadoop isn’t really suited to smaller companies. The sheer scale of its processing powers means it can’t function as effectively in smaller environments with limited data.
Streaming processing – Apache Storm.
While Hadoop specializes in batch processing, Storm takes a different approach. This real-time computation system is designed for processing unbounded data streams, where operations are applied to each item as it passes through, rather than across an entire dataset once it’s been completed. Storm’s developers describe it as “doing for real-time processing what Hadoop did for batch processing”.
They also rather optimistically describe Storm as “a lot of fun to use”, though it must also be acknowledged it’s relatively easy to integrate with legacy software. It’s also hugely powerful, processing a million multi-part data structures per node per second. That computational power has seen Storm adopted by leading online brands including Spotify, Groupon and VeriSign.
- Millisecond latency makes Storm a market leader in real-time analytics – or as close to real time as big data frameworks can currently get. That makes it invaluable for circumstances where processing feedback lives up to its name by being fed back to a website or database instantly.
- Storm can integrate with Hadoop deployments, adding streaming processing to Hadoop’s acknowledged batch processing expertise.
- Storm is compatible with numerous programming languages including Python, Ruby and Java, so it doesn’t require a particular stack to operate. This makes it accessible to a wider audience, since clients can port existing real-time processing code to run on Storm’s API.
- Preventing a single point of failure stopping all running jobs means Storm is extremely robust and resilient to process faults. If a node fails, its tasks will be assigned to other nodes in the cluster.
- Exactly-once processing can be carried out by the micro-batching Trident API, which sits on top of Storm.
- Storm’s pull model ensures data isn’t lost, with guaranteed processing taking place.
- Once a topology is started, Storm will run until it’s manually terminated. This ability to leave it in the background quietly processing data appeals to time-poor IT executives.
- The lack of batch processing capabilities (beyond micro-batching) means Storm is potentially limited in its appeal to companies whose data processing requirements may change in future.
- Twitter outgrew Storm’s performance boundaries as integral features like resource management became traffic bottlenecks. Firms planning on handling huge data volumes may also find themselves being slowed down in future.
- Debugging isn’t always easy to accomplish, and may become time-consuming.
Hybrid processing – Apache Spark.
Spark is another Apache Software Foundation project, which has found popularity as a more general-purpose platform than its cousins. It’s a batch processing framework that can also handle stream processing, making it a popular choice for companies with diverse processing requirements.
In 2014, Spark proved the fastest of all real-time systems for processing a hundred terabytes of data – something we approve of here at 100TB! Its strengths lie in batch or graph processing and SQL access, while micro-batching is great for smaller real-time data sets. Like Storm, Spark can run on top of Hadoop, however, it offers API support for fewer languages. Nonetheless, Spark is compatible with Java, Python and Scala.
- Spark is commonly used for recommendation engines, since its interactive analytics and machine learning are well-suited to the organic nature of streaming services or ecommerce platforms.
- Data is processed in-memory rather than on a disk, with the only storage layer interactions involving loading and displaying data. That makes Spark immensely quick.
- The inclusion of GraphX for distributed graphs is a useful computational feature, with algorithms and builders designed to simplify the task of generating customized graphical analytics.
- Another inclusion is the machine learning module that supports constant algorithmic development, improving the quality of data processing.
- Reliability is assured by Spark’s management of exactly once messages – superior to at most once delivery (where messages may be lost) or at least once (where messages might have to be redelivered).
- Being able to handle real-time and batch processing at once – albeit with some limitations – makes Spark a better all-rounder than a single-source specialist like Hadoop.
- Because it was launched so recently, Spark is better placed to exploit the greater capabilities of in-memory calculations offered by modern hardware. It can use as much RAM as is available, making it an order of magnitude faster than older rivals.
- Spark often incurs higher latency than rival platforms, due to buffering data as it’s loaded. It’s therefore not ideal for scenarios where performance of the streaming application is a primary concern, while memory processing may be higher than with rival platforms.
- If a network receiver fails, tiny fragments of data might be lost, though a failed worker node won’t prevent the system from replacing lost data based on leftover input data. While this can be avoided with Write Ahead Logs, the latency will increase – none of which is necessary with Storm.
- Because live data streams are split into batches and processed in batches, Spark isn’t technically delivering real-time processing.
Big data is only going to get bigger
The benefits of big data processing can be summed up by considering the advertising industry. The hit-and-hope nature of traditional print advertising has already been transformed by the reams of targeted data generated through platforms like Facebook, with pay-per-click campaigns providing a far more measured and cost-effective way to spread a corporate message. Big data could provide another quantum leap, ensuring that sales and marketing campaigns are directed to people with the highest likelihood of being interested. Today’s PPC could eventually be made to look as crude and vague as billboard advertising appears compared to modern Google AdWords campaigns.
Some companies have argued that big data is an issue they don’t need to be concerned about at present. However, there is compelling evidence to the contrary. Big data tutorials and software rollouts already affect numerous aspects of our daily lives.
These are some of the key industries where framework tools like Hadoop and Spark will eventually have the most transformative effects:
- Government. Governments have often been caught out by unexpected events, such as natural disasters or viral outbreaks. Being able to detect and study patterns of behavior among its citizens would enable any government to make more accurate predictions and respond to real-time events better. Consider the slow and inefficient response to Hurricane Katrina, where big data could have predicted likely population displacement or been used to direct relief and redistribution efforts more effectively.
- Transport. Autonomous vehicles aren’t the stuff of science fiction any more – they’re already rolling down highways from California to Sweden. Each vehicle relies on vast quantities of sensory data to plot a route, while immense processing power attempts to predict (and respond to) the innumerable scenarios that arise on public roads. Big data can also help with traffic control and route planning, alleviating congestion and maximizing resources. It could even influence future infrastructure investment decisions across road, rail and air.
- Manufacturing. As natural resources are depleted and global populations continue to rise, optimizing efficiency will be critical for manufacturers. Huge amounts of data are currently being left untapped, preventing improvements in efficiency and reliability. Companies employing big data processing could improve logistics, from raw mineral extraction through to final retailer delivery.
- Finance and banking. The importance of split-second data inputs is uniquely relevant in the world of finance, where the fractional delay caused by multiple internet nodes has potential to cost companies millions of dollars. Algorithms that can combine historic and live data feeds to identify trends or opportunities would surpass any level of human knowledge. Big data could improve today’s banking fraud detection systems, and the SEC has already introduced network analytics and natural language processors to try and identify illegal trading.
- Communications and media. From curated on-demand content suggestions to non-chronological social media timelines, many people’s first exposure to big data will come via entertainment platforms. The process of analyzing and exploiting patterns of behavior supports greater personalization, which in turn makes consumers feel more valued and brand-loyal. Telecom companies and social media providers are uniquely blessed with a wealth of raw data. – Analyzing it will enable them to deliver exactly what consumers want, based on their own habits.
- Education. It’s a curious fact that our education sector has remained relatively resilient to the lure of big data. Yet this is a perfect example of data siloes – different lecturers teaching varied subjects to large groups of students with individual strengths and weaknesses. If academic staff had a platform to share all this knowledge, comprehensive patterns of behavior would become evident and educational standards would rise across the board. However, work in this sector requires the very highest standards of data protection and personal privacy to be adhered to.
- Energy. Our existing energy infrastructure is creaking at the seams, and supply is barely keeping up with demand. Big data could radically improve our knowledge (and usage) of power and water. Smart energy meters are a baby step towards a truly connected future, monitoring power consumption in real time and generating extremely accurate bills. Knowing exactly how and when consumers require resources could support better load balancing, resource redistribution and economies of scale.
- Healthcare. Our smart toothbrushes and bathroom scales aren’t uploading data for the sake of it. Like the revolution in optometry achieved by retinal scanners and fundus cameras, being able to study our bodies in greater detail will reveal long-term trends in both individuals and society. The former could enable preventative medical treatments, while the latter might help medical practitioners to prepare for changing healthcare needs over time. From disease tracking to hospital staffing, analytical studies of healthcare could raise operational efficiencies and lower medical premiums.
- Telecoms. The days of fixed mobile tariffs and set data allowances will surely come to an end when big data identifies how individuals use their communication packages. The Holy Grail of quad play subscriptions (phone, TV, internet and mobile) will be far more tempting if people are being offered the services they want and the data they need, rather than arbitrary packages that try to cover all bases. Big data might also support a greater understanding of real-time bandwidth requirements, and optimized data distribution.
- Retail. Of all the industries in this list, retail has always had a head start in terms of available data. The barcode scanners and customer loyalty cards familiar to modern consumers could become far more powerful thanks to retail data analytics, which should prevent stock ever selling out. It has potential to support optimized staff levels at any given time, based on shopping pattern analysis and consideration of factors from live traffic congestion to local holidays.
- Insurance. The insurance industry has always operated using a form of educated guesswork, based on loss adjustment and estimated values. Imagine the precision big data could bring to this market. Knowing the exact value of purchased items, the specifics of crime data and policyholder lifestyle behaviors would support fully tailored premiums and claims management procedures. Such improvements would encourage customers to take out appropriate insurance where it’s not already compulsory, as well as reassuring them their premiums were precisely calculated rather than arbitrarily estimated.
- Agriculture. Our last example is perhaps the most compelling, as an expanding global population competes for diminishing natural resources. From large-scale GM modeling to livestock feed rationing, the potential for optimizing efficiency is irrefutable. Weather modeling could support better crop yields, and statistical analysis of waste may enable farms to ensure nothing is unnecessarily wasted or overlooked.