- Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
- Open source Apache project that exchanges data between a database and HDFS
- Can import all tables, single tables or even partial tables with free form SQL queries into HDFS
- Data can be imported in a variety of formats
- also be used to export data from HDFS back into standard relational databases
- Sqoop is a client side application that imports data using Hadoop MapReduce
- An import involves three main steps
- Examine the table details
- Create and submit job to cluster
- Fetch records from table and write this data to HDFS
- Under the hood the dataset being transferred is sliced up into different partitions using multiple connections to the database
We have few dataset in Mysql in cloudera quickstart VM. I thought its a good source to get few examples of sqoop command around this data and import it to Hive.
Login to mysql instance in cloudera quickstart VM with below credentials:
username: root
password: cloudera$ mysql -uroot -pcloudera mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | cm | | firehose | | hue | | metastore | | mysql | | nav | | navms | | oozie | | retail_db | | rman | | sentry | +--------------------+ 12 rows in set (0.19 sec) mysql> USE retail_db; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> show tables; +---------------------+ | Tables_in_retail_db | +---------------------+ | categories | | customers | | departments | | order_items | | orders | | products | +---------------------+ 6 rows in set (0.21 sec) For our sqoop import, we will import tables “customers” and “categories”
-- In Hive create a database with name 'sqoop_db' to import all this data. This can be any database, 'sqoop_db' is just an example -- Creates database with name 'sqoop_db' CREATE DATABASE IF NOT EXISTS sqoop_db COMMENT "This is the database used for sqoop import" LOCATION "/user/cloudera/.sqoop_db" ; We will import from Mysql table “retail_db.customers” to Hive “sqoop_db.customers” table
$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \ --username "root" --password "cloudera" \ --mapreduce-job-name "This is our first sqoop import" \ --query "SELECT * FROM customers WHERE \$CONDITIONS" \ --hive-import \ --create-hive-table \ --fields-terminated-by ',' \ --hive-table sqoop_db.customers \ --split-by "customer_id" \ --target-dir "/user/cloudera/customers" We will import from Mysql table “retail_db.categories” to Hive “sqoop_db.categories” table
# Specify limit/Boundary condition on how many rows of data to import $sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \ --username "root" --password "cloudera" \ --mapreduce-job-name "This is our second sqoop import - with boundary condition" \ --query "SELECT * FROM categories WHERE \$CONDITIONS" \ --boundary-query "SELECT 10, 1000 FROM categories" \ --hive-import \ --create-hive-table \ --fields-terminated-by ',' \ --hive-table sqoop_db.categories \ --split-by "customer_id" \ --target-dir "/user/cloudera/categories" Note: Can exceed the number of connections, if you dont specify this in production set up
-m,–num-mappersUse n map tasks to import in parallel Check whether all tables are are imported to Hive database:
## Login to hive $ beeline -u jdbc:hive2://quickstart.cloudera:10000/default -n cloudera -p cloudera beeline> USE sqoop_db; beeline> SHOW TABLES;
can u tell me the Limitation of using query in sqoop?