Hive – Best Practices

By | August 8, 2017
  1. Testing with Dummy data – Check here
  2. Beeline doesnt honor tabs, if you are using any editors, you can replace tabs with space to maintain the structure and still use beeline effectively.

    
       Ex: CREATE TABLE IF NOT EXISTS default.test1 (id<tab>INT,name STRING); -- this will fail
       Hive will throw an error saying "Error: Error while compiling statement: FAILED: ParseException line ... cannot recognize input near 'idINT' ..."
       This should be fixed in the latest version of Hive 2.1.0
    
  3. Include ‘IF NOT EXISTS’ in your create statements. Prevents an error from occurring if the table exists.

    
    CREATE TABLE IF NOT EXISTS <table_name> …
    CREATE VIEW IF NOT EXISTS <view_name> …
    
  4. Include ‘IF EXISTS’ in your drop statements
    
    DROP TABLE IF EXISTS <table_name> PURGE;
    DROP VIEW IF EXISTS <view_name>;
    
  5. If multiple columns are concatenated by same separated use CONCAT_WS instead of CONCAT
    
    CONCAT_WS('---', column1, column2, column3)
    
  6. Joins
    Place the bigger table first in the join query

    
    SELECT idOne, idTwo, value 
        FROM bigTable 
        JOIN smallTableOne 
        ON (bigTable.idOne = smallTableOne.idTwo)
        ;
    
  7. Analyze statement
    If we are re-using the table, its good practice to run analyse at the end of insert all data to the table. Ex:

    
    -- Non-partitioned table
    ANALYZE TABLE ${TABLE_SAMPLE1} COMPUTE STATISTICS;
        
    -- Partitioned table
    ANALYZE TABLE ${TABLE_SAMPLE1} PARTITION (year, month) COMPUTE STATISTICS;
    
  8. Set job name
    SET mapreduce.job.name = My first job - Source1;</pre>
    </li>
    <li>Set queue name
    <pre><xmp>SET mapreduce.job.queuename=root.queuename1;</pre>
    </li>
    <li>Set staging dir to save intermediate results
    <pre><xmp>
    -- This is HDFS location
    SET hive.exec.stagingdir=/tmp/staging;
    
  9. Set hive variables
    Whenever we have a dynamic element in our query, its a good practice to set that as a variable
    Ex: year of interest, month, table name, table location, etc.

    Always end the variable declaration with semicolon ‘;’

    
    -- Comment goes here. Always comment one line above the variable, dont comment at the end of the line, beeline scipts ran as a file does not honor the comment at the end of the line
    SET hivevar:DB_NAME=default;
    SET hivevar:TABLE_SAMPLE1=test1;
    SET hivevar:HDFS_LOGICAL_NAME=hdfs://quickstart.cloudera;
    SET hivevar:LOCATION_SAMPLE1='${HDFS_LOGICAL_NAME}/user/cloudera/training/test1';
    SET hivevar:YEAR_OF_ANALYSIS=2005;
        
    -- To verify the value of the variable, you can select that variable as part of testing
    SELECT "${DB_NAME}" , "${TABLE_SAMPLE1}" , "${LOCATION_SAMPLE1}" , "${YEAR_OF_ANALYSIS}";
    -- Output:
    +----------+--------+------------------------------------------------------------+-------+--+
    |   _c0    |  _c1   |                            _c2                             |  _c3  |
    +----------+--------+------------------------------------------------------------+-------+--+
    | default  | test1  | 'hdfs://quickstart.cloudera/user/cloudera/training/test1'  | 2005  |
    +----------+--------+------------------------------------------------------------+-------+--+
    
    
    Example:
    DROP TABLE IF EXISTS ${DB_NAME}.${TABLE_SAMPLE1} PURGE;
    
    CREATE TABLE IF NOT EXISTS ${DB_NAME}.${TABLE_SAMPLE1}
    (
        id  INT
        ,name   STRING
        ,year_of_join   STRING
    )
    STORED AS TEXTFILE
    LOCATION ${LOCATION_SAMPLE1}
    ;
    
    !sh date
    INSERT INTO TABLE ${DB_NAME}.${TABLE_SAMPLE1} 
    VALUES (1, 'puneetha' , 2016)
    ,(2, 'bhoomika' , 2005)
    ;
    !sh date    
    
    !sh date
    -- Use the variables in below query
    SELECT * FROM ${DB_NAME}.${TABLE_SAMPLE1} WHERE year_of_join = ${YEAR_OF_ANALYSIS};
    !sh date
    -- Output:
    +-----------+-------------+---------------------+--+
    | test1.id  | test1.name  | test1.year_of_join  |
    +-----------+-------------+---------------------+--+
    | 2         | bhoomika    | 2005                |
    +-----------+-------------+---------------------+--+
    
  10. To initialize the variable, leave it blank, this way the variable will be recognized but replaced with nothing

    
    -- With Variable values
    SET hivevar:HDFS_LOGICAL_NAME=hdfs://quickstart.cloudera;
    SET hivevar:LOCATION_SAMPLE1='${HDFS_LOGICAL_NAME}/user/cloudera/training/test1';
    
    -- To verify the value of the variable, you can select that variable as part of testing
    SELECT "${LOCATION_SAMPLE1}";
    Output:
    +----------------------------------------------------+--+
    |                        _c0                         |
    +----------------------------------------------------+--+
    | 'hdfs://quickstart.cloudera/user/cloudera/training/test1' |
    +----------------------------------------------------+--+
    
    
    
    -- With empty value, only to recognize variable name:
    SET hivevar:HDFS_LOGICAL_NAME=;
    SET hivevar:LOCATION_SAMPLE1='${HDFS_LOGICAL_NAME}/user/cloudera/training/test1';
    
    -- To verify the value of the variable, you can select that variable as part of testing
    SELECT "${LOCATION_SAMPLE1}";
    Output:
    +----------------------------------+--+
    |               _c0                |
    +----------------------------------+--+
    | '/user/cloudera/training/test1'  |
    +----------------------------------+--+
    
    
  11. Change a managed table to an external table with “TRUE” and vice versa with “FALSE”.
    TBLPROPERTIES (“EXTERNAL”=”TRUE”) in release 0.6.0+

    
    Ex: ALTER TABLE  ${DB_NAME}.${TABLE_SAMPLE1} SET TBLPROPERTIES ("EXTERNAL"="TRUE");
    
  12. Drop a specific partition
    
    ALTER TABLE db_name.table_name DROP IF EXISTS PARTITION (year=2014, month=5) PURGE;
    
  13. Print comments
    
    !sh echo "This is my first query"
    
  14. Save result set
    
    !record <file_name>
    -- Your query
    !record
    
  15. Run SQL file
    
    !run <file_name>
    
  16. Print date
    
    !sh date
    
  17. Print all configurations
    
    --There will be too many configurations dumped to the screen. If you really intend to capture all the settings, save the result set as pointed in point 1
    SET -v;
    

Leave a Reply

Your email address will not be published. Required fields are marked *