

SNAPPY COMPRESSION RATION ARCHIVE
However, HAR still has some limitations that make it unpopular, such as immutable archive process, not being splittable, and compatibility issues.Ĭompression techniques in Hive can significantly reduce the amount of data transferring between mappers and reducers by proper intermediate output compression as well as output data size in HDFS by output compression. This is an option for storing a large number of small-sized files in HDFS, as storing a large number of small-sized files directly in HDFS is not very efficient. Hadoop Archive File (HAR) is another type of file format to pack HDFS files into archives. If you use several tools in the Hadoop ecosystem, PARQUET is a better choice in terms of adaptability. What’s more, Parquet has a wider range of support for the majority projects in the Hadoop ecosystem compared to ORC that only supports Hive and Pig.Ĭonsidering the maturity of Hive, it is suggested to use the ORC format if Hive is the main majority tool used in your Hadoop environment. PARQUET – This is another row columnar file format that has a similar design to that of ORC. sum will store the total length of all strings If the sum overflows long at any point during the calculation, no sum is recorded.įor strings, the minimum value, maximum value, and the sum of the lengths of the values are recorded. It also stores basic statistics, such as MIN, MAX, SUM, and COUNT, on columns as well as a lightweight index that can be used to skip blocks of rows that do not matter.įor integer and float types (tinyint, smallint, int, bigint,float,double), the column statistics includes the minimum, maximum, and sum.

Different from RCFILE that relies on metastore to know data types, the ORC file understands the data types by using specific encoders so that it can optimize compression depending on different types. It provides a larger block size of 256 MB by default (RCFILE has 4 MB and SEQUENCEFILE has 1 MB) optimized for large sequential reads on HDFS for more throughput and fewer files to reduce overload in the namenode. The ORC format can be considered an improved version of RCFILE. ORC – This is short for Optimized Row Columnar.
