In focus

How To Ingest Data In Hadoop

In this article we will understand different ways to ingest data in Hadoop.

Data Analyst Mar 28, 2016

Hadoop is one of the best solutions for solving our Big Data problems. In Hadoop we distribute our data among the clusters, these clusters help by computing the data in parallel. We have a number of options to put our data into the HDFS, but choosing which tools or technique is best for you is the game here. In this article we will explore what options are available and when to use them.

File Transfer

It is the simplest way of putting or ingesting data into Hadoop. This can be used to transfer any type of file type (text, binary, images etc). Some of the drawbacks include that we cannot transform our data while transferring our data to HDFS. One other way of file transfer is using mountable HDFS which means using or mounting HDFS as the standard file system. We can use simple HDFS commands to browse and download put data. The downside of using this or HDFS is that random writes are not allowed, the file system is "write once, read many. "

Sqoop is a command line application that helps us to transfer data from a relational database to HDFS. Internally Sqoop uses Mappers from MapReduce to connects to the database using JDBC after that it selects the data and writes it into HDFS. It is a standard and easier approach to ingest data from Relational Database Management System (RDBMS) to HDFS.  Sqoop provides flexibility and extensibility moreover it uses multiple mappers to parallels data ingestion.

This tool is limited to the structured data ingestion only. Where Sqoop is a generic solution for RDBMS data ingestion many vendors have specialized products that provide better performance.

Flume is a mechanism for moving large volumes of data. It is often used for Log Data. Flume can take data from several sources like Files, Syslog and Avro. It can deliver data to several destination like HDFS and HBase. It is distributed and reliable.

It is highly customizable and a reliable way to near Real Time data loading in HDFS. Some of the disadvantages are it is relatively complex to configure and it is not the ideal choice for streaming of data.

Apache Storm is an Open Source distributed Real Time computation system that was made by Twitter. Storm does real time processing where Hadoop does Batch processing. It provides continuous real-time query with low latency. Some other features are reliability and scalability. We may need other tools to store and query the output of the Storm.

Here's a free e-book on Big Data: Understanding Big Data

big data hadoop hdfs sqoop