Selecting the right big data analytics tools is tough. The market is awash with options, from those designed for use by data scientists to others aimed squarely at non-technical business types. Some are very flexible, with the ability to craft precise queries covering your every analytical need. Others are more specialized, built to carry out one type of analysis very well. They might be integrated with a specific platform or distribution, or they could be standalone tools and cloud services.
What Does Your Big Data Analytics Tool Choice Depend On?
Given the variety of solutions and the diverse needs they cater for, your choice of big data analysis tools will depend on a number of factors, chiefly:
- Your organization’s level of technical and statistical modeling skills;
- The type(s) of data you want to analyze;
- Whether you want to analyze big data in real time;
- Who you want to use the tool (data scientists and IT teams, non-technical departments such as marketing or HR, senior management, etc);
- Whether you’re looking to conduct a specific type of analysis or just ‘fishing’ for insights;
- How easily the tool integrates with your existing big data framework.
Selecting The Right Database
Traditional relational databases (RDBs) aren’t always well suited the so-called ‘three Vs’ of big data – volume, variety, and velocity – particularly if you’re hoping your big data analytics tools can run real-time, or near real-time, analysis. While the big players are making theirs more ‘big data friendly’, RDBs are generally hard to scale and data must generally be converted to an appropriately structured format. As a result, there has been a rapid rise in the use of NoSQL databases for querying big data that’s unstructured (or partially structured).
The main categories of NoSQL database to consider are:
- Document stores – such as MongoDB, Apache Cassandra, CouchDB, and ElasticSearch – which are commonly used to handle data in popular web formats like XML and JSON, but can be adapted to suit almost any schema. They can also write and query data very quickly.
- Key-value stores – such as Redis and Riak – store data as buckets of key-value pairs, similar to hashing. They are extremely fast for writing, reading and updating when you know the key value, but slow when it comes to carrying out multiple updates, or if you need to query the whole database.
- Column stores – such as Cassandra, Google BigQuery and Druid – are suited to higher-end big data applications where data has some structure. They appear to store data in rows, but they actually serialize it into columns and are blisteringly fast when it comes to querying and processing.
- Graph databases – such as Neo4j – are best when what matters most is the relationships between data. They store data as a web of intersecting nodes and are great for uncovering connections and correlations between disparate records very quickly.
For real-time big data analysis, you may also need the speed of an in-memory database, since querying data stored on disk is unlikely to be fast enough.
Commercial Or Open Source?
Big data analytics tools from major industry players like Oracle, IBM and SAP will generally have a lot of functionality, be well supported, and could be a good fit for larger, more established businesses. However, they can be pricey and might not give you the flexibility you need. A lot of the newer NoSQL databases are open source, meaning you can download and play with several to find out what works best for you. It’s also generally easier to tweak open-source databases to fit your requirements more closely.
As well as being able to query the data, you will want the ability to visualize it. This is an essential component when trying to communicate any insights gleaned by anyone without data analysis expertise, or in other words, most people in the organization. Visualization tools can produce meaningful, attractive visualizations without the need for coding, from bar charts and scatter graphs to dashboards, maps and more. Many are also simple enough for non-technical staff to use to create their own analyses and visualizations. They include:
- Tableau (currently widely regarded as the most comprehensive big data visualization tools);
- Infogram (one of the best if you want to do big data visualization in real time);
- CartoDB (if you’re handling lots of location data, this is great for map-based visualization);
- Plot.ly (for stunning 2D and 3D charts);
- Qlik Sense (good at highlighting patterns in data).
Wikipedia is a good starting point to delve deeper into the specifics of different big data analytics tools, but these links may also help:
- There’s a fairly comprehensive list of NoSQL databases here and another with more details about their suitability for different applications here.
- A curated list of ‘awesome big data resources‘ on Github.
- Import.io’s ‘best big data tools and how to use them’.
- TechTarget’s comparison of some of the leading big data analytics tools, with an emphasis on those from large IT industry vendors.
We look forward to hearing your opinions and experience on the big data analysis tools you’ve tried and tested.