Terraform (IaC – Infrastructure as Code) Testing

Terraform is an Infrastructure as Code tool that enables us to deploy predictable infrastructure. Usually for the Software/Data applications we focus on writing code, testing it, making sure its following all the conventions, making sure there are no security issues, etc. etc. But when it comes to deploying Infrastructure, we don’t normally treat Terraform or… Read More »

DynamoDB – Data Modelling – NoSql Workbench

DynamoDB NoSql Workbench Design data models Visualize data models Query data Build new data models from scratch Modify and export existing DynamoDB data models Add tables and indexes to data models Add sample data to data models Visualize data layout for data models Commit data models to DynamoDB Build data plane operations using a structured… Read More »

Hive Macros examples

Hive Macros examples Data Type check – Check if a given column is a number DROP TEMPORARY MACRO IF EXISTS isNumber; CREATE TEMPORARY MACRO isNumber (input INT) CASE WHEN CAST(input AS INT) IS NULL THEN ‘NO’ else ‘YES’ END ; — Usage: SELECT isNumber(100), isNumber(“123”), isNumber(“12sd”); — Output +——+——+——+ | _c0 | _c1 | _c2… Read More »

Hive useful query stuff

Hive useful query stuff Generate random float numbers SELECT ARRAY( CAST(ROUND(RAND()* 100, 2) AS FLOAT) , CAST(ROUND(RAND()* 1000, 2) AS FLOAT) , CAST(ROUND(RAND()* 10000, 2) AS FLOAT) , CAST(ROUND(RAND()* -100, 2) AS FLOAT) , CAST(ROUND(RAND()* -1000, 2) AS FLOAT) , CAST(ROUND(RAND()* -10000, 2) AS FLOAT) )[CAST((FLOOR(RAND()*2)) AS INT)] ; — Output Everytime you run above… Read More »

Getting started with Big data (Hadoop, Spark)!

Few questions running in your mind right now How and where do I start with Big data space? What is Big data, Hadoop, Spark, Hive, HBase, etc? What should I learn first? Which path should I take? Data Science, Data Engineer, Architect!

PIG – general stuff

Adding jar REGISTER /local/path/to/myjar_name.jar Set queue name Specify below in the pig script SET mapreduce.job.queuename ‘my_queuename’; (or) specify while running the PIG script $ pig -Dmapreduce.job.queuename=my_queuename -f my_script.pig Set job name Specify below in the pig script SET mapreduce.job.name ‘Testing HCatalog’; (or) specify while running the PIG script $ pig -Dmapreduce.job.name=”Testing HCatalog” -f my_script.pig

Category: Pig

Hive – Timezone problem

Timezone problem – Any function which triggers mapreduce job, causes this problem, since it takes the local timezone of machine where it runs the mapper/reducer In our case, lets say our servers are in German timezone i.e. CET — With original settings SET system:user.country; +————————-+–+ | set | +————————-+–+ | system:user.country=GB | +————————-+–+ — Original… Read More »

Search for a file in HDFS using Solr Find tool

HdfsFindTool is essentially the HDFS version of the Linux file system find command. The command walks one or more HDFS directory trees, finds all HDFS files that match the specified expression, and applies selected actions to them. By default, it prints the list of matching HDFS file paths to stdout, one path per line. Search… Read More »

Solr Installation and create new collection – standalone

Note: I am running this in Windows. Download Solr Download Solr from here I have downloaded solr-7.0.1: http://mirrors.whoishostingthis.com/apache/lucene/solr/7.0.1/solr-7.0.1.zip For this example, we will extract it to the folder C:\Users\Public\hadoop_ecosystem\solr-7.0.1 Start Solr Open command prompt and type below commands > c: > cd C:\Users\Public\hadoop_ecosystem\solr-7.0.1\bin C:\Users\Public\hadoop_ecosystem\solr-7.0.1\bin> solr start -p 8983 Output: Waiting up to 30 to see… Read More »