The Era of Big Data Has Come

Over the past decade, the massive increase in digital data has forced researchers to find new ways to analyze the real world and to anticipate the future. The concept of “Big Data” was born. It mainly consists of storing a huge amount of real-world information in digital form.

What is Big Data?

The term “Big Data” refers to a very large volume of data that no conventional data management and processing system could not really understand. Nrearly 3 billion bytes of data are created every day, information from research or shopping online, videos, weather information, etc. The term refers to this huge volumes of data. The major online companies like Amazon, Google, Yahoo! and Facebook were the first to develop this technology for their own use.

The advent of massive data is now seen by many as a new industrial revolution similar to the advent of electricity or steam during the 19th century. Whatever the comparison, Big Data can clearly be seen as a profound source of disruption to our modern society.

3VS of Big Data

The solutions must meet the high requirements of big data: (1) their enormous volume, (2) the variety of information they represent, both structured and unstructured, and (3) the speed they demand to be created, collected and distributed.

In recent years, the new technologies on the market have complied with 3VS: volume, variety and velocity. The first storage technologies, in particular have lead to cloud computing. Then came new technologies for processing and managing database adapted to unstructured data (Hadoop) and high performance computing modes (MapReduce).

Several technologies may be required to optimize the access times to large databases such as NoSQL databases such as MongoDB or Cassandra, server infrastructure for the distribution of treatments on the nodes and storage of data memory:

The first solution can implement storage systems considered more effective than traditional SQL for mass data analysis.

The second is called the massively parallel processing. The Hadoop framework is one example. This combines the HDFS distributed file system, based NoSQL HBase and MapReduce algorithm. Regarding the latter, it speeds up the processing time of applications.

Evolution of Big Data

The development of the Spark and the end of MapReduce

Spark gradually replaces MapReduce: As in all technologies, big data are a constantly changing environment. The technology landscape evolves rapidly, new solutions are frequently required in order to optimize existing technologies. MapReduce and Spark are very concrete examples of this evolution trend.

Developed by Google in 2004, MapReduce was then used for the Nutch project Yahoo !, to become later on the Apache Hadoop project in 2008. This algorithm has a large data storage capacity. The only problem is its relative slowness particularly visible on volumes relatively small. Despite this, solutions aiming at providing almost instantaneous treatments on these volumes begin to reduce the influence of MapReduce. In 2014, Google announced that it would replace MapReduce by a SaaS solution called Google Cloud Dataflow.

Spark is also a symbolic solution for writing distributed applications with conventional processing libraries. It is also one of the Apache projects with a speed of rapid development. In short, it is an obvious solution as the successor of MapReduce, especially since it has the advantage of combining many tools required in a Hadoop cluster.

The main market players

This industry has attracted many companies, including the incumbent suppliers of software solutions such as Oracle, SAP and IBM. Large web companies like Google, Facebook, Twitter. And data specialists like MapR, Hortonworks or Teradata. IT integrators include the big names in this sector with CapGemini, Atos and Accenture. Many startups rapidly emerging such as Criteo, Squid, Ysance, Hurence, Dataiku … Not to mention the schools, universities and training organizations that provide partial or complete courses around these new technologies.

Professionals on these new technologies are still scarce on the market. Yet demand is growing, and many job offers are available online, both in USA, Europe and worldwide (see job market worldwide).