- To set user timezone:
SET mapreduce.map.java.opts=”-Duser.timezone=UTC”; SET mapreduce.reduce.java.opts=”-Duser.timezone=UTC”;
- Compress results
-- Determines whether the output of the final map/reduce job in a query is compressed or not. SET hive.exec.compress.output=true; -- Determines whether the output of the intermediate map/reduce jobs in a query is compressed or not. SET hive.exec.compress.intermediate=true; - Avro settings – Compression
-- Supported codecs are snappy and deflate. SET avro.output.codec=snappy; - Parquet settings – Compression
SET parquet.compression=snappy; - Hive – Hint – STREAMTABLE
-- In every map/reduce stage of the join, the table to be streamed can be specified via a hint. SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) All the three tables are joined in a single map/reduce job and the values for a particular value of the key for tables b and c are buffered in the memory in the reducers. Then for each row retrieved from a, the join is computed with the buffered rows. If the STREAMTABLE hint is omitted, Hive streams the rightmost table in the join.
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins - Joins – while select
Mapjoin – Hive enables the optimization about converting common join into mapjoin based on the input file size. Prevent mapjoins for large tablesSET hive.auto.convert.join=false; (OR) By specifying the hint "MAPJOIN" SELECT /*+ MAPJOIN(s) */ idOne, idTwo, value FROM bigTable b JOIN smallTableOne s ON (b.idOne = s.idTwo) -- Optimize Chains of Map Joins select /*+ MAPJOIN(time_dim, date_dim) */ count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk) join date_dim on (ss_sold_date_sk = d_date_sk) where t_hour = 8 and d_year = 2002 ; Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization
- Sort Merge bucket map join
SET hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; SET hive.optimize.bucketmapjoin=true; SET hive.optimize.bucketmapjoin.sortedmerge=true; - Bucket mapjoin
-- Enforce bucketing - while inserting SET hive.enforce.bucketing=true; -- Enforce sorting in each bucket - while inserting SET hive.enforce.sorting=true; >
- Sort memory – The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.
SET io.sort.mb=800; Note: io.sort.mb should be 10 * io.sort.factor - Memory
-- Shuffle memory SET mapreduce.reduce.shuffle.memory.limit.percent=0.65; -- Map memory SET mapreduce.map.java.opts=-Xmx8192m; -- Reduce memory SET mapreduce.reduce.java.opts=-Xmx12288m; - Execute jobs in parallel – — Whether to execute jobs in parallel. Applies to MapReduce jobs that can run in parallel, for example jobs processing different source tables before a join. As of Hive 0.14, also applies to move tasks that can run in parallel, for example moving files to insert targets during multi-insert.
SET hive.exec.parallel=true; - Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.
SET mapreduce.job.reduce.slowstart.completedmaps=0.7; - Data skew – Skew join Optimization – If your data highly uneven distribution, the data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation
SET hive.optimize.skewjoin=true; -- Syntax: SET hive.skewjoin.key= ; SET hive.skewjoin.key=100000; -- Skew data could also happen in GROUP BY data too. -- Hive will first trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew. SET hive.groupby.skewindata=true; - Avoid shunting of results in Hue
-- File format to use for a query's intermediate results. This takes care of results shunting in Hue SET hive.query.result.fileformat=SequenceFile;