In focus

What Are Different Data Compression Methods In Hadoop

In this article, we will discuss about different data compression methods in Hadoop.

Data Analyst Apr 06, 2016

As we all know. in a typical Hadoop environment we have to store and deal with large volumes of data which makes data compression a necessity. We all know that the slave nodes stores data in them, commonly each slave node or datanode has about 45TB of raw storage space available for HDFS.
 data compression in hadoop
Although Hadoop machines are designed to be inexpensive adding memory to the machines can significantly cost you. We can simply solve this problem by compressing a;; of our data.
Benefits of Data Compressing 
  • It saves a great deal of storage space.
  • It also speeds up the transfer of the blocks throughout the clusters.
Obviously we have a variety of codecs to choose from depending upon our usage and scenario. If you don’t understand what codec is then, here’s a quick definition.
Codec = Compressor + Decompressor

It can be set a set of hardware or software that compresses and decompresses the data using a specific algorithm. We already know that the if we are ingesting a file to HDFS and it has size more than the default file block size (128 MB), it will automatically split the file into separate chunks. Here an important concept of Splittable Compression comes into play. Most codecs cannot decompress the file block splits of the files independently, because of which we have to allocate the job to a single mapper.
Now we cannot parallely process data that results in decrease of our performance; speaking of performance we have a trade-off between Compression degree, a compression, and space.
Generally if we have more degree compression then we would have to put more computation power or computation cycles in compression/decompression of the data. But more degree of compression saves us space.
Some of the common codecs supported by the Hadoop framework are as follows:
  • Gzip -  A compression utility that was adopted by the GNU project. It’s file have an extension of .gz. You can use gunzip command to decompress the files.

  • Bzip2 -  from the usability standpoint Bzip2 and Gzip are similar. But Bzip2 has much more degree of compression then the Gzip but it is also slower . You can use Bzip2 codec space priority is higher and the data will be rarely needed to be queried.

  • Snappy - Snbappy is the codec by Google , It provides fastest compression and decompression among all the codec but comes with a modest degree of compression.

  • LZO -  Similar to Snappy LZO gives fast compression and decompression with modest compression degree. LZO is licensed under GNU Public License (GPL).
Hadoop Codecs

 Codec File Extension Splittable ? Degree of Compression Compression Speed
 Gzip .gz No Medium Medium
 Bzip2 .bz2 Yes High Slow
 Snappy .snappy No Medium Fast
LZO   .lzo No, unless indexed Medium Fast
Now that you know about the most used Compression codecs in Hadoop, let's see how can you choose them.
All compression algorithms must make a trade-off between the degree of compression and the speed of compression/decompression. We must choose these codecs accordingly to our scenario, suppose we have to make an archive of data which will be rarely queried then we must use codecs like Bzip2 and if we often need to access our data we can use the algorithms like Snappy which give us fastest data compression and decompression.

compression hadoop hdfs open source