Why Big Data? And what’s so big about it?
At first glance at the term, you may conclude its definition, and you may go “Hey, it definitely isn’t what it sounds like”.
Actually, it is exactly what it sounds like.
Let’s just take a quick reminder of what “data” is:
Data is synonymous with information. In computer science, data is a representation of information, and it can have many forms and structures (e.g.: tables, trees, graphs, etc….).
So what exactly is Big Data?
Like we said, it is exactly what it sounds like, Big Data is a collection of very large and complex data sets, so large that the traditional methods of data processing like database management systems and file systems isn’t just doing it anymore.
How big is it, you might ask?
Well, let’s just say your hard drive is of size 1 Terabytes (1000 Gigabytes), if you’re one of the majority of people who uses the PC just for browsing the internet, checking multimedia ,or playing video games, then you’ll be really satisfied by how much free space you have now on your hard drive and how pretty much you will never delete any data on it.
Well, Big Data may reach to the size of Exabytes, which is 1 billion Gigabytes.
As of 2012, 2.5 exabytes of data are created every day, a size that even the most advanced of info management systems weren’t designed to handle.
Popular Open Source Tools for Big Data
Due to the incredible success of the architecture of MapReduce, an implementation of its framework was adopted by an Apache open source project named Hadoop.
1- Big Data Analysis Platforms and Tools:
Hadoop, MapReduce, GridGain, Storm
2- Databases/Data Warehouses
Cassandra, HBase, MongoDB
CouchDB, Redis
3- Business Intelligence Talend
Jaspersoft
4- Data Mining RapidMiner/RapidAnalytics
Mahout, Orange
5- Big Data Search
Lucene, Solr
“Information is the oil of the 21st century, and analytics is the combustion engine.”
Peter Sondergaard, Senior Vice President and Global Head of Research at Gartner, Inc.
In 2004, Google published a paper describing a new process called MapReduce that provides a parallel processing model consisting of nodes in which the queries are split, distributed and processed (Map). The results are then gathered and delivered (Reduce).