Import Data From a MySQL, Oracle, PostgreSQL, or SQL Server Database

You can use Apache Sqoop to automate the process of importing data from various databases, such as MySQL, Oracle, PostgreSQL, and SQL Server.

Make sure you have enough rights to access the database.
  1. Download the table data.

    • MySQL

    • Oracle

    • PostgreSQL

    • SQL Server

    # Specify the JDBC driver to connect to the MySQL database.
    driver=mysql
    # Specify the IP or DNS address of the MySQL database server.
    ip=127.0.0.1
    # Specify the port number on which the MySQL database server is listening.
    port=3306
    # Specify the username used to log in to the MySQL database.
    username=myuser (1)
    # Specify the password used to log in the user to the MySQL database.
    password=mypwd (1)
    # Specify the name of the MySQL database from which the data will be imported.
    database=mydb (1)
    # Specify the name of the table in the MySQL database from which the data will be imported.
    table=mytable (1)
    # Specify the destination folder in HDFS where data will be stored.
    hdfsdest=/user/hdfs/$table (1)
    
    # Import data from the specified table in the MySQL database into HDFS.
    sqoop import --connect jdbc:$driver://$ip:$port/$database --username $username --password $password \
    --target-dir $hdfsdest \
    --num-mappers 1 \
    --table $table

    Where:

    1 These fields must be replaced with the correct values according to your database and HDFS setup.
    # Specify the JDBC driver to connect to the Oracle database.
    driver=oracle:thin
    # Specify the IP or DNS address of the Oracle database server.
    ip=127.0.0.1
    # Specify the port number on which the Oracle database server is listening.
    port=1521
    # Specify the username used to log in to the Oracle database.
    username=myuser (1)
    # Specify the password used to log in the user to the Oracle database.
    password=mypwd (1)
    # Specify the name of the Oracle database from which the data will be imported.
    database=mydb (1)
    # Specify the name of the table in the Oracle database from which the data will be imported.
    table=mytable (1)
    # Specify the destination folder in HDFS where data will be stored.
    hdfsdest=/user/hdfs/$table (1)
    
    # Import data from the specified table in the Oracle database into HDFS.
    sqoop import --connect jdbc:$driver:@//$ip:$port/$database --username $username --password $password \
    --target-dir $hdfsdest \
    --num-mappers 1 \
    --table $table

    Where:

    1 These fields must be replaced with the correct values according to your database and HDFS setup.
    # Specify the JDBC driver to connect to the PostgreSQL database.
    driver=postgresql
    # Specify the IP or DNS address of the PostgreSQL database server.
    ip=127.0.0.1
    # Specify the port number on which the PostgreSQL database server is listening.
    port=5432
    # Specify the username used to log in to the PostgreSQL database.
    username=myuser (1)
    # Specify the password used to log in the user to the PostgreSQL database.
    password=mypwd (1)
    # Specify the name of the PostgreSQL database from which the data will be imported.
    database=mydb (1)
    # Specify the name of the table in the PostgreSQL database from which the data will be imported.
    table=mytable (1)
    # Specify the destination folder in HDFS where data will be stored.
    hdfsdest=/user/hdfs/$table (1)
    
    # Import data from the specified table in the PostgreSQL database into HDFS.
    sqoop import --connect jdbc:$driver://$ip:$port/$database --username $username --password $password \
    --target-dir $hdfsdest \
    --num-mappers 1 \
    --table $table

    Where:

    1 These fields must be replaced with the correct values according to your database and HDFS setup.
    # Specify the JDBC driver to connect to the SQL Server database.
    driver=sqlserver
    # Specify the IP or DNS address of the SQL Server database server.
    ip=127.0.0.1
    # Specify the port number on which the SQL Server database server is listening.
    port=1433
    # Specify the username used to log in to the SQL Server database.
    username=myuser (1)
    # Specify the password used to log in the user to the SQL Server database.
    password=mypwd (1)
    # Specify the name of the SQL Server database from which the data will be imported.
    database=mydb (1)
    # Specify the name of the table in the SQL Server database from which the data will be imported.
    table=mytable (1)
    # Specify the destination folder in HDFS where data will be stored.
    hdfsdest=/user/hdfs/$table (1)
    
    # Import data from the specified table in the SQL Server database into HDFS.
    sqoop import --connect "jdbc:$driver://$ip:$port;DatabaseName=$database" --username $username --password $password \
    --target-dir $hdfsdest \
    --num-mappers 1 \
    --table $table

    Where:

    1 These fields must be replaced with the correct values according to your database and HDFS setup.
  2. OPTIONAL: You can add the following options to your code:

    Option Description

    --verbose

    Shows more logs.

    --delete-target-dir

    Deletes the HDFS directory before import.

    --query "SELECT * FROM $table WHERE \$CONDITIONS"

    Imports data from a query instead of a table.

    The --table option must be removed from the command line.

    --num-mappers n

    Splits the data flow into n data flow, where n is the number of data nodes, and must be replaced with a number.

    This option must be used with the --split-by 'column' option, which specifies the column to split.

    --as-textfile

    Imports data in text mode.

    --null-non-string '\\N'

    The string to be written for a null value for non-string columns.

    Used for Impala and Hive.

    --null-string '\\N'

    The string to be written for a null value for string columns.

    Used for Impala and Hive.

    --hive-import

    Creates a Hive and Impala table with the imported data.

    --hive-table $table

    Specifies the name of the Hive and Impala table.

    --hive-overwrite

    Overwrites the data of the table. If this option is not present, the data is concatenated.

    --hive-database db

    Specifies the name of the Hive and Impala database. Where db must be replaced with the name of your database.

    --hive-drop-import-delims

    Removes \n, \r, and \01 from string fields when importing into Hive.

    -D oracle.sessionTimeZone=Europe/Paris

    For Oracle database only.

    Selects the time zone of your date format.