Read and Write Tables From Hive With Talend

How to read and write tables from Hive without and with a Kerberized cluster with Talend.

Without Kerberos

Before you begin:

To follow this procedure, you must have a Repository containing the information of the Saagie platform. To this end, create a context group and define the contexts and variables required.

In this case, define the following context variables:

Name Type Value

IP_Hive

String

Your Hive IP address

Port_Hive

String

Your Hive port

Table_Hive

String

The Hive source table

BDD_Hive

String

The Hive working database

User_Hive

String

The Hive user authentication name

New_Table_Hive

String

The Hive target table

Password

Password

The Hive user account password

Once your context group is created and stored in the Repository, you can apply the Repository context variables to your jobs.

For more information, read the whole section on Using contexts and variables on Talend Studio User Guide.

Read Tables

This tutorial is an example of how to count lines.

  1. Create a new job in Talend.

  2. Add the following components:

    • tHiveConnection, to establish a connection to a Hive database to be reused by other Hive components in your job.

    • tHiveInput, to extract data from Hive and send the data to the following component.

    • tLogRow, to display the result.

  3. Link these components as follows:

    hive read table with talend

  4. Double-click each component and configure their settings as follows:

    • tHiveConnection

    • tHiveInput

    • tLogRow

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. From the Connection mode list, select a connection mode.

    5. From the Hive server list, select the Hive server through which you want the job using this component to execute queries on Hive.

    6. In the Host field, enter the database server IP address.

    7. In the Port field, enter the listening port number of the database server.

    8. In the Database field, enter the name of the database.

    9. In the Username and Password fields, enter the authentication information of the user.

    10. In the Additional JDBC Settings field, you can specify additional connection properties for the database connection you are creating.

    For more information, you can refer to Talend’s documentation on the tHiveConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tHiveConnection to reuse the connection details already defined.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    5. In the Table name field, enter the name of the table.

    6. From the Query type field, select Built-In to fill in manually the query statement or build it graphically using SQLBuilder.

    7. In the Query field, enter your database query as follows: "SELECT COUNT(*) FROM "+context.Table_Hive+"".

    For more information, you can refer to Talend’s documentation on the tHiveInput component.
    For more information, you can refer to Talend’s documentation on the tLogRow component.
  5. Run the job.

Write Tables

This tutorial is an example of how to create a new table and insert ten rows.

  1. Create a new job in Talend.

  2. Add the following components:

    • tHiveConnection, to establish a connection to a Hive database to be reused by other Hive components in your job.

    • tHiveRow, to execute a SQL query at each iteration of the Talend flow.

  3. Link these components with the OnSubjobOk connection.

    hive write table with talend

  4. Double-click each component and configure their settings as follows:

    • tHiveConnection

    • tHiveRow

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. From the Connection mode list, select a connection mode.

    5. From the Hive server list, select the Hive server through which you want the job using this component to execute queries on Hive.

    6. In the Host field, enter the database server IP address.

    7. In the Port field, enter the listening port number of the database server.

    8. In the Database field, enter the name of the database.

    9. In the Username and Password fields, enter the authentication information of the user.

    10. In the Additional JDBC Settings field, you can specify additional connection properties for the database connection you are creating.

    For more information, you can refer to Talend’s documentation on the tHiveConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tHiveConnection to reuse the connection details already defined.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    5. In the Table name field, enter the name of the table.

    6. From the Query type field, select Built-In to fill in manually the query statement or build it graphically using SQLBuilder.

    7. In the Query field, enter your database query as follows: "CREATE TABLE "+context.New_Table_Hive+" AS SELECT * FROM "+context.Table_Hive+" LIMIT 10".

    For more information, you can refer to Talend’s documentation on the tHiveRow component.
  5. Run the job.

With Kerberos

Before you begin:

To follow this procedure, you must have a Repository containing the information of the Saagie platform. To this end, create a context group and define the contexts and variables required.

In this case, define the following context variables:

Name Type Value

Kerberos_Login

String

The Kerberos user authentication name

Kerberos_Pwd

String

The Kerberos user account password

Kerberos_Principal_Name

String

The Kerberos principal name

Kerberos_Principal_Hive

String

The kerberos principal for Hive service

IP_Hive

String

Your Hive IP address

Port_Hive

int | Integer

Your Hive port

Uri_Path

String

URI to store file in HDFS

Once your context group is created and stored in the Repository, you can apply the Repository context variables to your jobs.

For more information, read the whole section on Using contexts and variables on Talend Studio User Guide.

Read Tables With a Kerberized Cluster

  1. Create a new job in Talend.

  2. Add the following components:

    • tSystem, to establish the Kerberos connection.

    • tHiveConnection, to establish a connection to a Hive database to be reused by other Hive components in your job.

    • tHiveInput, to extract data from Hive and send the data to the following component.

    • tLogRow, to display the result.

  3. Link these components as follows:

    hive read table with kerberos with talend

  4. Double-click each component and configure their settings as follows:

    • tSystem

    • tHiveConnection

    • tHiveInput

    • tLogRow

    In the Basic settings tab:

    1. Select the Use Array Command option.
      It will activate its Command field.

    2. In this Command field, enter the following system command in array, one parameter per line.

      "/bin/bash"
      "-c"
      "echo '"+context.kerberos_pwd+"' | kinit "+context.kerberos_login
    3. From the Standard Output and Error Output list, select the type of output for the processed data to be transferred to.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    For more information, you can refer to Talend’s documentation on the tSystem component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. From the Connection mode list, select a connection mode.

    5. From the Hive server list, select the Hive server through which you want the job using this component to execute queries on Hive.

    6. In the Host field, enter the database server IP address.

    7. In the Port field, enter the listening port number of the database server.

    8. In the Database field, enter the name of the database.

    9. In the Username and Password fields, enter the authentication information of the user.

    10. In the Additional JDBC Settings field, you can specify additional connection properties for the database connection you are creating.

    11. Select the Use kerberos authentication option, and then enter the relevant parameters in the fields that appear.

    For more information, you can refer to Talend’s documentation on the tHiveConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tHiveConnection to reuse the connection details already defined.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    5. In the Table name field, enter the name of the table.

    6. From the Query type field, select Built-In to fill in manually the query statement or build it graphically using SQLBuilder.

    7. In the Query field, enter your database query.

    For more information, you can refer to Talend’s documentation on the tHiveInput component.
    For more information, you can refer to Talend’s documentation on the tLogRow component.
  5. Run the job.

Write Tables With a Kerberized Cluster

  1. Create a new job in Talend.

  2. Add the following components:

    • tSystem, to establish a Kerberos connection.

    • tHiveConnection, to establish a connection to a Hive database to be reused by other Hive components in your job.

    • tHDFSConnection, to establish an HDFS connection.

    • tHiveCreateTable, to create Hive tables that fit a wide range of Hive data formats.

    • tRowGenerator, to generate data.

    • tHDFSOutput, to write data flows it receives into a given HDFS.

    • tHiveLoad, to write data of different formats into a given Hive table or to export data from a Hive table to a directory.

  3. Link these components as follows:

    hive write table with kerberos with talend

  4. Double-click each component and configure their settings as follows:

    • tSystem

    • tHiveConnection

    • tHDFSConnection

    • tHiveCreateTable

    • tRowGenerator

    • tHDFSOutput

    • tHiveLoad

    In the Basic settings tab:

    1. Select the Use Array Command option.
      It will activate its Command field.

    2. In this Command field, enter the following system command in array, one parameter per line.

      "/bin/bash"
      "-c"
      "echo '"+context.kerberos_pwd+"' | kinit "+context.kerberos_login
    3. From the Standard Output and Error Output list, select the type of output for the processed data to be transferred to.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    For more information, you can refer to Talend’s documentation on the tSystem component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. From the Connection mode list, select a connection mode.

    5. From the Hive server list, select the Hive server through which you want the job using this component to execute queries on Hive.

    6. In the Host field, enter the database server IP address.

    7. In the Port field, enter the listening port number of the database server.

    8. In the Database field, enter the name of the database.

    9. In the Username and Password fields, enter the authentication information of the user.

    10. In the Additional JDBC Settings field, you can specify additional connection properties for the database connection you are creating.

    11. Select the Use kerberos authentication option, and then enter the relevant parameters in the fields that appear.

    For more information, you can refer to Talend’s documentation on the tHiveConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system.

    5. Select the Use kerberos authentication option, and then enter the relevant parameters in the fields that appear.

    For more information, you can refer to Talend’s documentation on the tHDFSConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tHiveConnection to reuse the connection details already defined.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    5. In the Table name field, enter the name of the table.

    6. From the Action on table list, select the action to be carried out for creating a table.

    7. From the Format list, select the data format to which the table to be created is dedicated.

    8. Select the Set Delimited row format option.

    For more information, you can refer to Talend’s documentation on the tHiveCreateTable component.
    For more information, you can refer to Talend’s documentation on the tRowGenerator component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Schema list, select Built-In to create and store the schema locally for this component only.

    3. Select the Use an existing connection option.

    4. From the Component List list, select the connection component tHiveConnection to reuse the connection details already defined.

    5. From the Load action list, select the type of the file to be processed.

    6. From the Action list, select the action that you want to perform on the file.

    7. In the Row separator field, identify the end of a row.

    8. In the Field separator field, enter a character, a string, or a regular expression to separate fields for the transferred data.

    For more information, you can refer to Talend’s documentation on the tHDFSOutput component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.

    4. From the Load action list, select the action you need to carry for writing data into the specified destination.

    5. In the File path field, enter the directory you need to read data from or write data in, depending on the action you have selected from the Load action list.

    6. In the Table name field, enter the name of the Hive table you need to write data in.

    7. From the Action on file list, select the action to be carried out for writing data.

    For more information, you can refer to Talend’s documentation on the tHiveLoad component.
  5. Run the job.