Read and Write Tables From Impala With Talend

How to read and write tables from Impala without and with a Kerberized cluster with Talend.

Without Kerberos

Before you begin:

To follow this procedure, you must have a Repository containing the information of the Saagie platform. To this end, create a context group and define the contexts and variables required.

In this case, define the following context variables:

Name Type Value

IP_Impala

String

Your Impala IP address

Port_Impala

String

Your Impala port

Table_Impala

String

The Impala source table

BDD_Impala

String

The Impala working database

User_Impala

String

The Impala user authentication name

New_Table_Impala

String

The Impala target table

Password

Password

The Impala user account password

Once your context group is created and stored in the Repository, you can apply the Repository context variables to your jobs.

For more information, read the whole section on Using contexts and variables on Talend Studio User Guide.

Read Tables

This tutorial is an example of how to count lines.

  1. Create a new job in Talend.

  2. Add the following components:

    • tImpalaConnection, to establish a connection to an Impala database to be reused by other Impala components in your job.

    • tImpalaInput, to execute the select queries to extract data from Impala and send the data to the following component.

    • tLogRow, to display the result.

  3. Link these components as follows:

    hive read table with talend
    The above example is with Hive components.
  4. Double-click each component and configure their settings as follows:

    • tImpalaConnection

    • tImpalaInput

    • tLogRow

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. In the Host field, enter the database server IP address.

    5. In the Port field, enter the listening port number of the database server.

    6. In the Database field, enter the name of the database.

    7. In the Username and Password fields, enter the authentication information of the user.

    8. In the Additional JDBC Settings field, you can specify additional connection properties for the database connection you are creating.

    For more information, you can refer to Talend’s documentation on the tImpalaConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tImpalaConnection to reuse the connection details already defined.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    5. In the Table name field, enter the name of the table.

    6. From the Query type field, select Built-In to fill in manually the query statement or build it graphically using SQLBuilder.

    7. In the Query field, enter your database query as follows: "SELECT COUNT(*) FROM "+context.Table_Impala+"".

    For more information, you can refer to Talend’s documentation on the tImpalaInput component.
    For more information, you can refer to Talend’s documentation on the tLogRow component.
  5. Run the job.

Write Tables

This tutorial is an example of how to create a new table and insert ten rows.

  1. Create a new job in Talend.

  2. Add the following components:

    • tImpalaConnection, to establish a connection to an Impala database to be reused by other Impala components in your job.

    • tImpalaRow, to execute a SQL query at each iteration of the Talend flow.

  3. Link these components with the OnSubjobOk connection.

    hive write table with talend
    The above example is with Hive components.
  4. Double-click each component and configure their settings as follows:

    • tImpalaConnection

    • tImpalaRow

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. In the Host field, enter the database server IP address.

    5. In the Port field, enter the listening port number of the database server.

    6. In the Database field, enter the name of the database.

    7. In the Username and Password fields, enter the authentication information of the user.

    8. In the Additional JDBC Settings field, you can specify additional connection properties for the database connection you are creating.

    For more information, you can refer to Talend’s documentation on the tImpalaConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tImpalaConnection to reuse the connection details already defined.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    5. In the Table name field, enter the name of the table.

    6. From the Query type field, select Built-In to fill in manually the query statement or build it graphically using SQLBuilder.

    7. In the Query field, enter your database query as follows: "CREATE TABLE "+context.New_Table_Impala+" AS SELECT * FROM "+context.Table_Impala+" LIMIT 10".

    For more information, you can refer to Talend’s documentation on the tImpalaRow component.
  5. Run the job.

With Kerberos

Before you begin:

To follow this procedure, you must have a Repository containing the information of the Saagie platform. To this end, create a context group and define the contexts and variables required.

In this case, define the following context variables:

Name Type Value

Kerberos_Login

String

The Kerberos user authentication name

Kerberos_Pwd

String

The Kerberos user account password

Kerberos_Principal_Name

String

The Kerberos principal name

IP_Impala

String

Your Impala IP address

Port_Impala

int | Integer

Your Impala port

Once your context group is created and stored in the Repository, you can apply the Repository context variables to your jobs.

For more information, read the whole section on Using contexts and variables on Talend Studio User Guide.

Read Tables With a Kerberized Cluster

  1. Create a new job in Talend.

  2. Add the following components:

    • tSystem, to establish the Kerberos connection.

    • tImpalaConnection, to establish a connection to an Impala database to be reused by other Impala components in your job.

    • tImpalaInput, to execute the select queries to extract data from Impala and send the data to the following component.

    • tLogRow, to display the result.

  3. Link these components as follows:

    impala read table with kerberos with talend

  4. Double-click each component and configure their settings as follows:

    • tSystem

    • tImpalaConnection

    • tImpalaInput

    • tLogRow

    In the Basic settings tab:

    1. Select the Use Array Command option.
      It will activate its Command field.

    2. In this Command field, enter the following system command in array, one parameter per line.

      "/bin/bash"
      "-c"
      "echo '"+context.kerberos_pwd+"' | kinit "+context.kerberos_login
    3. From the Standard Output and Error Output list, select the type of output for the processed data to be transferred to.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    For more information, you can refer to Talend’s documentation on the tSystem component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. In the Host field, enter the database server IP address.

    5. In the Port field, enter the listening port number of the database server.

    6. In the Username and Password fields, enter the authentication information of the user.

    7. Select the Use Kerberos authentication option to access your Impala system using Kerberos security, and enter the Kerberos principal of your Impala system.

    For more information, you can refer to Talend’s documentation on the tImpalaConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tImpalaConnection to reuse the connection details already defined.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    5. In the Table name field, enter the name of the table.

    6. From the Query type field, select Built-In to fill in manually the query statement or build it graphically using SQLBuilder.

    7. In the Query field, enter your database query.

    For more information, you can refer to Talend’s documentation on the tImpalaInput component.
    For more information, you can refer to Talend’s documentation on the tLogRow component.
  5. Run the job.

Write Tables With a Kerberized Cluster

  1. Create a new job in Talend.

  2. Add the following components:

    • tSystem, to establish the Kerberos connection.

    • tImpalaConnection, to establish a connection to an Impala database to be reused by other Impala components in your job.

    • tRowGenerator, to generate data.

    • tImpalaOutput, to connect to an Impala database and write data in an Impala table.

  3. Link these components as follows:

    impala write table with kerberos with talend

  4. Double-click each component and configure their settings as follows:

    • tSystem

    • tImpalaConnection

    • tRowGenerator

    • tImpalaOutput

    In the Basic settings tab:

    1. Select the Use Array Command option.
      It will activate its Command field.

    2. In this Command field, enter the following system command in array, one parameter per line.

      "/bin/bash"
      "-c"
      "echo '"+context.kerberos_pwd+"' | kinit "+context.kerberos_login
    3. From the Standard Output and Error Output list, select the type of output for the processed data to be transferred to.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    For more information, you can refer to Talend’s documentation on the tSystem component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. In the Host field, enter the database server IP address.

    5. In the Port field, enter the listening port number of the database server.

    6. In the Username and Password fields, enter the authentication information of the user.

    7. Select the Use Kerberos authentication option to access your Impala system using Kerberos security, and enter the Kerberos principal of your Impala system.

    For more information, you can refer to Talend’s documentation on the tImpalaConnection component.
    For more information, you can refer to Talend’s documentation on the tRowGenerator component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tImpalaConnection to reuse the connection details already defined.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    5. In the Table name field, enter the name of the table.

    6. From the Action list, select whether you want to OVERWRITE the old data already existing in the destination or only APPEND the new data to the existing one.

    7. Select the Extended insert option to combine multiple rows of data into one single INSERT action. This can speed up the insert operation.

    For more information, you can refer to Talend’s documentation on the tImpalaOutput component.
  5. Run the job.