Import Data From Other Relational Database Management System (RDBMS)

You can use Apache Sqoop to automate the process of importing data from other RDBMS than MySQL, Oracle, PostgreSQL, and SQL Server.

Apache Sqoop uses JDBC (Java Database Connectivity) to connect databases. Make sure you have the JAR file to access your database.
  1. Check that your job can download the .jar file from HDFS:

    hdfs dfs -get /path/folder/my_JDBC_file.jar (1) (2)

    Where

    1 /path/folder/ is the path to your folder.
    2 my_JDBC_file.jar is the name of your JAR file.
  2. Add the .jar file to the HADOOP_CLASSPATH environment variable:

    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:./my_JDBC_file.jar

    Adding the JAR file to the classpath is essential, as it enables Hadoop to find and use the classes contained in the JAR file at runtime.

  3. Run the following Sqoop script:

    # Specify the database driver used to connect to your RDBMS.
    driver=<rdms> (1)
    # Specify the IP or DNS address of the database server.
    ip=<database-ip> (2)
    # Specify the port number on which the database server is listening.
    port=<port> (3)
    # Specify the username used to log in to the database.
    username=myuser (4)
    # Specify the password used to log in to the database.
    password=mypwd (4)
    # Specify the name of the database from which the data will be imported.
    database=mydb (4)
    # Specify the name of the table in the database from which the data will be imported.
    table=mytable (4)
    # Specify the destination folder in HDFS where data will be stored.
    hdfsdest=/user/hdfs/$table (4)
    
    # Import data from the specified table in the RDBMS into HDFS.
    sqoop import --connect jdbc:$driver://$ip:$port/$database --username $username --password $password \
    --target-dir $hdfsdest \
    --num-mappers 1 \
    --table $table

    Where:

    1 <rdms> must be replaced with the specific JDBC driver for the RDMS you want to use.
    2 <database-ip> must be replaced with the IP address of your database server.
    3 <port> must be replaced with the port number on which the database server is listening.
    4 The other fields must also be replaced with the correct values corresponding to the database and HDFS setup.