Big Data Technologies
Common Characteristics of Big Data Projects
– SI Adaptability: Also referred to as “scalability” of the system, that is to say its ability to respond to high peak loads: storage and analysis of massive data.
– Moderate Cost: For obvious reasons of ROI, the SI must be able to gain power to meet these peak loads but at an affordable cost. It is not surprising to see that the implementation of Big Data systems rely on input servers or midrange. The number of servers that allows SI’s growing power and not necessarily the power of each server.
– Open source: Most big data players converted their proprietary solutions in Open Source solutions.
– Fault tolerance: SI Big Data environments are distributed over many servers. IF these are designed to offset the inevitable failures in this type of environment (detection and anticipation of failures).
Hadoop, The Big Data Framework
Hadoop is a framework or an Open Source platform. Originally designed by Google and developed by Yahoo !, is the Hadoop Big Data platform par excellence. While competing frameworks do exist in this market, Hadoop has naturally become the Big Data framework of reference for very large volumes of data (projects dealing mostly volumes greater than 10 terabytes of data).
Hadoop Distributed File System or HDFS
HDFS – Hadoop Distributed Data System is a distributed file system in which each server in Hadoop environment hosts some of the data. HDFS is highly fault tolerant system (data file replication to multiple servers or data centers) .
Map/Reduce is a Hadoop component that enables treatments distribution directly where the data is located. The phase Map selects and organizes the data considered relevant for the processing, and phase Reduce associate or consolidates data.
When combined to these HDFS and Map/Reduce components Hadoop provides an exceptional treatment capacity to the distributed architecture. Through the distribution of data and parallel processing, the processing power of this big data environment can be adjusted by increasing the power of the server (or node or cluster). Hadoop is particularly of great value on large volumes of data that can be exploited to their maximum all its processing capabilities.
There are several versions of Hadoop Distribution: OpenSource Version (Apache Foundation), and more advanced versions with support such as Cloudera or Hortonworks. Software vendors also offer an integrated Hadoop solution in their own environment (Oracle, IBM, Microsoft).
Big Data is a real ecosystem that is constantly evolving. It is therefore difficult to establish a comprehensive vision of it. It is up to the reader to gather information directly from leading institutions in this market.