Read, Write, and List Files From HDFS With Talend

How to read, write, and list files from HDFS without and with a Kerberized cluster with Talend.

Before you begin:

To follow this procedure, you must have a Repository containing the information of the Saagie platform. To this end, create a context group and define the contexts and variables required.

In this case, define the following context variables:

Name Type Value

IP_HDFS

String

Your HDFS IP address

Port_HDFS

String

Your HDFS port

Folder_HDFS

String

The folder to be selected in the HDFS

File_HDFS

String

The file name to be selected from HDFS

Folder_Local

Directory

The local directory to store the files get from HDFS

Input_File

File

The local file

Name_File

String

The name for the retrieve file

Name_File_Input

String

The name of local file

User_HDFS

String

The HDFS user authentication name

Once your context group is created and stored in the Repository, you can apply the Repository context variables to your jobs.

For more information, read the whole section on Using contexts and variables on Talend Studio User Guide.

Without Kerberos

Read Files

  1. Create a new job in Talend.

  2. Add the following components:

    • tHDFSConnection, to establish an HDFS connection to be reused by other HDFS components in your job.

    • tHDFSInput, to extract data in an HDFS file for other components to process it.

    • tLogRow, to display the result.

  3. Link these components as follows:

    hdfs read file with talend

  4. Double-click each component and configure their settings as follows:

    • tHDFSConnection

    • tHDFSInput

    • tLogRow

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"

    5. In the Username field, enter the HDFS user authentication name.

    6. Enter the following Hadoop properties:

      Property Value

      "dfs.nameservices"

      "nameservice1"

      "dfs.ha.namenodes.cluster"

      "nn1,nn2"

      "dfs.client.failover.proxy.provider.cluster"

      "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"

      "dfs.namenode.rpc-address.cluster.nn1"

      "nn1:8020"

      "dfs.namenode.rpc-address.cluster.nn2"

      "nn2:8020"

    7. Disable the Use datanode hostname option.

    For more information, you can refer to Talend’s documentation on the tHDFSConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Schema list, select Built-In to create and store the schema locally for this component only.

    3. Click Edit schema to make changes to the schema, and add a flow variable.

    4. Select the Use an existing connection option.

    5. From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.

    6. In the File Name field, browse to, or enter the path pointing to the data to be used in the file system.

    7. From the Type list, select the type of the file to be processed.

    8. In the Row separator field, you can identify the end of a row.

    9. In the Field separator field, you can enter a character, a string, or a regular expression to separate fields for the transferred data.

    10. In the Header field, you can set values to ignore the header of the transferred data.

    For more information, you can refer to Talend’s documentation on the tHDFSInput component.
    For more information, you can refer to Talend’s documentation on the tLogRow component.
  5. Run the job.

Write Files

  1. Create a new job in Talend.

  2. Add the following components:

    • tHDFSConnection, to establish an HDFS connection to be reused by other HDFS components in your job.

    • tFileInputDelimited, to read a delimited file row by row to split them up into fields and then send the fields as defined in the schema to the next component.

    • tHDFSOutput, to write data into a given HDFS.

  3. Link these components as follows:

    hdfs write file with talend

  4. Double-click each component and configure their settings as follows:

    • tHDFSConnection

    • tFileInputDelimited

    • tHDFSOutput

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"

    5. In the Username field, enter the HDFS user authentication name.

    6. Enter the following Hadoop properties:

      Property Value

      "dfs.nameservices"

      "nameservice1"

      "dfs.ha.namenodes.cluster"

      "nn1,nn2"

      "dfs.client.failover.proxy.provider.cluster"

      "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"

      "dfs.namenode.rpc-address.cluster.nn1"

      "nn1:8020"

      "dfs.namenode.rpc-address.cluster.nn2"

      "nn2:8020"

    7. Disable the Use datanode hostname option.

    For more information, you can refer to Talend’s documentation on the tHDFSConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. In the File Name/Stream field, enter the name and path of the file to be processed.

    3. In the Row separator field, you can identify the end of a row.

    4. In the Field separator field, you can enter a character, a string, or a regular expression to separate fields for the transferred data.

    5. You can select the CSV options option and define your options.

    6. In the Header field, enter the number of rows to be skipped in the beginning of file.

    7. In the Footer field, enter the number of rows to be skipped at the end of the file.

    8. In the Limit field, enter the maximum number of rows to be processed.

    9. From the Schema list, select Built-In to create and store the schema locally for this component only.

    10. Select the Skip empty rows option.

    For more information, you can refer to Talend’s documentation on the tFileInputDelimited component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Schema list, select Built-In to create and store the schema locally for this component only.

    3. Click Edit schema to make changes to the schema, and add a flow variable as Input and Output.

    4. Select the Use an existing connection option.

    5. From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.

    6. In the File name field, browse to, or enter the location of the file which you write data to.

      This file is created automatically if it does not exist.
    7. From the Type list, select the type of the file to be processed.

    8. From the Action list, select the action that you want to perform on the file.

    9. In the Row separator field, you can identify the end of a row.

    10. In the Field separator field, you can enter a character, a string, or a regular expression to separate fields for the transferred data.

    For more information, you can refer to Talend’s documentation on the tHDFSOutput component.
  5. Run the job.

List Files

  1. Create a new job in Talend.

  2. Add the following components:

    • tHDFSConnection, to establish an HDFS connection reusable by other HDFS components.

    • tHDFSList, to retrieve a list of files or folders in a local directory.

    • tHDFSProperties, to create a single row flow that displays the properties of a file processed in HDFS.

    • tLogRow, to display the result.

  3. Link these components as follows:

    hdfs list file with talend

  4. Double-click each component and configure their settings as follows:

    • tHDFSConnection

    • tHDFSList

    • tHDFSProperties

    • tLogRow

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"

    5. In the Username field, enter the HDFS user authentication name.

    6. Enter the following Hadoop properties:

      Property Value

      "dfs.nameservices"

      "nameservice1"

      "dfs.ha.namenodes.cluster"

      "nn1,nn2"

      "dfs.client.failover.proxy.provider.cluster"

      "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"

      "dfs.namenode.rpc-address.cluster.nn1"

      "nn1:8020"

      "dfs.namenode.rpc-address.cluster.nn2"

      "nn2:8020"

    7. Select the Use datanode hostname option.

    For more information, you can refer to Talend’s documentation on the tHDFSConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.

    4. In the HDFS directory field, browse to, or enter the path pointing to the data to be used in the file system.

    5. From the FileList Type list, select the type of input you want to iterate on from the list.

    6. Select the Use Glob Expressions as Filemask option.

    7. In the Files field, add as many File mask as required.

    8. You can use the Order by and Order action features to sort your data.

    For more information, you can refer to Talend’s documentation on the tHDFSList component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    5. In the File field, browse to, or enter the path pointing to the data to be used in the file system.

    For more information, you can refer to Talend’s documentation on the tHDFSProperties component.
    For more information, you can refer to Talend’s documentation on the tLogRow component.
  5. Run the job.

With Kerberos

Before you begin:

To follow this procedure, you must have a Repository containing the information of the Saagie platform. To this end, create a context group and define the contexts and variables required.

In this case, define the following context variables:

Name Type Value

Kerberos_Login

String

The Kerberos user authentication name

Kerberos_Pwd

String

The Kerberos user account password

Kerberos_Principal_Name

String

The Kerberos principal name

Uri_Path

String

URI to store file in HDFS

Once your context group is created and stored in the Repository, you can apply the Repository context variables to your jobs.

For more information, read the whole section on Using contexts and variables on Talend Studio User Guide.

Read Files With a Kerberized Cluster

  1. Create a new job in Talend.

  2. Add the following components:

    • tSystem, to establish the Kerberos connection.

    • tHDFSConnection, to establish a connection to an HDFS database to be reused by other HDFS components in your job.

    • tHDFSInput, to extract data in an HDFS file for other components to process it.

    • tLogRow, to display the result.

  3. Link these components as follows:

    hdfs read file with kerberos with talend

  4. Double-click each component and configure their settings as follows:

    • tSystem

    • tHDFSConnection

    • tHDFSInput

    • tLogRow

    In the Basic settings tab:

    1. Select the Use Array Command option.
      It will activate its Command field.

    2. In this Command field, enter the following system command in array, one parameter per line.

      "/bin/bash"
      "-c"
      "echo '"+context.kerberos_pwd+"' | kinit "+context.kerberos_login
    3. From the Standard Output and Error Output list, select the type of output for the processed data to be transferred to.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    For more information, you can refer to Talend’s documentation on the tSystem component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"

    5. Select the Use kerberos authentication option, and then enter the relevant parameters in the fields that appear.

    6. Disable the Use datanode hostname option.

    For more information, you can refer to Talend’s documentation on the tHDFSConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Schema list, select Built-In to create and store the schema locally for this component only.

    3. Select the Use an existing connection option.

    4. From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.

    5. In the File Name field, browse to, or enter the path pointing to the data to be used in the file system.

    6. From the Type list, select the type of the file to be processed.

    7. In the Row separator field, you can identify the end of a row.

    8. In the Field separator field, you can enter a character, a string, or a regular expression to separate fields for the transferred data.

    9. In the Header field, you can set values to ignore the header of the transferred data.

    For more information, you can refer to Talend’s documentation on the tHDFSInput component.
    For more information, you can refer to Talend’s documentation on the tLogRow component.
  5. Run the job.

Write Files With a Kerberized Cluster

  1. Create a new job in Talend.

  2. Add the following components:

    • tSystem, to establish the Kerberos connection.

    • tHDFSConnection, to establish an HDFS connection to be reused by other HDFS components in your job.

    • tRowGenerator, to read a delimited file row by row to split them up into fields and then send the fields as defined in the schema to the next component.

    • tHDFSOutput, to write data into a given HDFS.

  3. Link these components as follows:

    hdfs write file with kerberos with talend

  4. Double-click each component and configure their settings as follows:

    • tSystem

    • tHDFSConnection

    • tRowGenerator

    • tHDFSOutput

    In the Basic settings tab:

    1. Select the Use Array Command option.
      It will activate its Command field.

    2. In this Command field, enter the following system command in array, one parameter per line.

      "/bin/bash"
      "-c"
      "echo '"+context.kerberos_pwd+"' | kinit "+context.kerberos_login
    3. From the Standard Output and Error Output list, select the type of output for the processed data to be transferred to.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    For more information, you can refer to Talend’s documentation on the tSystem component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"

    5. Select the Use kerberos authentication option, and then enter the relevant parameters in the fields that appear.

    6. Disable the Use datanode hostname option.

    For more information, you can refer to Talend’s documentation on the tHDFSConnection component.
    For more information, you can refer to Talend’s documentation on the tRowGenerator component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Schema list, select Built-In to create and store the schema locally for this component only.

    3. Click Edit schema to make changes to the schema, if needed.

    4. Select the Use an existing connection option.

    5. From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.

    6. In the File name field, browse to, or enter the location of the file, which you write data to.

      This file is created automatically if it does not exist.
    7. From the Type list, select the type of the file to be processed.

    8. From the Action list, select the action that you want to perform on the file.

    9. In the Row separator field, you can identify the end of a row.

    10. In the Field separator field, you can enter a character, a string, or a regular expression to separate fields for the transferred data.

    For more information, you can refer to Talend’s documentation on the tHDFSOutput component.
  5. Run the job.

List Files With a Kerberized Cluster

  1. Create a new job in Talend.

  2. Add the following components:

    • tSystem, to establish the Kerberos connection.

    • tHDFSConnection, to establish an HDFS connection reusable by other HDFS components.

    • tHDFSList, to retrieve a list of files or folders in a local directory.

    • tRowGenerator, to generate data.

    • tLogRow, to display the result.

  3. Link these components as follows:

    hdfs list file with kerberos with talend

  4. Double-click each component and configure their settings as follows:

    • tSystem

    • tHDFSConnection

    • tHDFSList

    • tRowGenerator

    • tLogRow

    In the Basic settings tab:

    1. Select the Use Array Command option.
      It will activate its Command field.

    2. In this Command field, enter the following system command in array, one parameter per line.

      "/bin/bash"
      "-c"
      "echo '"+context.kerberos_pwd+"' | kinit "+context.kerberos_login
    3. From the Standard Output and Error Output list, select the type of output for the processed data to be transferred to.

    4. From the Schema list, select Built-In to create and store the schema locally for this component only.

    For more information, you can refer to Talend’s documentation on the tSystem component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. From the Distribution list, select the Cloudera cluster.

    3. From the Version list, select the latest version of Cloudera.

    4. In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"

    5. Select the Use kerberos authentication option, and then enter the relevant parameters in the fields that appear.

    6. Disable the Use datanode hostname option.

    For more information, you can refer to Talend’s documentation on the tHDFSConnection component.

    In the Basic settings tab:

    1. From the Property type list, select Built-in so that no property data is stored centrally.

    2. Select the Use an existing connection option.

    3. From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.

    4. In the HDFS directory field, browse to, or enter the path pointing to the data to be used in the file system.

    5. From the FileList Type list, select the type of input you want to iterate on from the list.

    6. Select the Use Glob Expressions as Filemask option.

    7. In the Files field, add as many File mask as required.

    8. You can use the Order by and Order action features to sort your data.

    For more information, you can refer to Talend’s documentation on the tHDFSList component.

    Define the structure of data to be generated as follows:

    1. Add a column by clicking the plus (+) button.

    2. Define the nature of the data in the Type column by selecting a value from the list.

    3. Set the environment variable.

    4. In the Number of Rows for RowGenerator field, enter the number of rows to generate.

    For more information, you can refer to Talend’s documentation on the tRowGenerator component.
    For more information, you can refer to Talend’s documentation on the tLogRow component.
  5. Run the job.

See also