Read, Write, and List Files From HDFS With Talend

How to read, write, and list files from HDFS without and with a Kerberized cluster with Talend.

Before you begin:

To follow this procedure, you must have a Repository containing the information of the Saagie platform. To this end, create a context group and define the contexts and variables required.

In this case, define the following context variables:

Name Type Value

Name	Type	Value
`IP_HDFS`	String	Your HDFS IP address
`Port_HDFS`	String	Your HDFS port
`Folder_HDFS`	String	The folder to be selected in the HDFS
`File_HDFS`	String	The file name to be selected from HDFS
`Folder_Local`	Directory	The local directory to store the files get from HDFS
`Input_File`	File	The local file
`Name_File`	String	The name for the retrieve file
`Name_File_Input`	String	The name of local file
`User_HDFS`	String	The HDFS user authentication name

IP_HDFS

String

Your HDFS IP address

Port_HDFS

String

Your HDFS port

Folder_HDFS

String

The folder to be selected in the HDFS

File_HDFS

String

The file name to be selected from HDFS

Folder_Local

Without Kerberos

Read Files

Create a new job in Talend.
Add the following components:
- tHDFSConnection, to establish an HDFS connection to be reused by other HDFS components in your job.
- tHDFSInput, to extract data in an HDFS file for other components to process it.
- tLogRow, to display the result.
Link these components as follows:
- Link tHDFSConnection and tHDFSInput with the OnSubjobOk connection.
- Link tHDFSInput and tLogRow with the Main connection.

Double-click each component and configure their settings as follows:

tHDFSConnection
tHDFSInput
tLogRow

In the Basic settings tab:

From the Property type list, select Built-in so that no property data is stored centrally.
From the Distribution list, select the Cloudera cluster.
From the Version list, select the latest version of Cloudera.
In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"
In the Username field, enter the HDFS user authentication name.

Enter the following Hadoop properties:

Property Value

Property	Value
`"dfs.nameservices"`	`"nameservice1"`
`"dfs.ha.namenodes.cluster"`	`"nn1,nn2"`
`"dfs.client.failover.proxy.provider.cluster"`	`"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"`
`"dfs.namenode.rpc-address.cluster.nn1"`	`"nn1:8020"`
`"dfs.namenode.rpc-address.cluster.nn2"`	`"nn2:8020"`

"dfs.nameservices"

"nameservice1"

"dfs.ha.namenodes.cluster"

"nn1,nn2"

"dfs.client.failover.proxy.provider.cluster"

"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"

"dfs.namenode.rpc-address.cluster.nn1"

"nn1:8020"

"dfs.namenode.rpc-address.cluster.nn2"

"nn2:8020"

Disable the Use datanode hostname option.

For more information, you can refer to Talend’s documentation on the tHDFSConnection component.

In the Basic settings tab:

From the Property type list, select Built-in so that no property data is stored centrally.
From the Schema list, select Built-In to create and store the schema locally for this component only.
Click Edit schema to make changes to the schema, and add a flow variable.
Select the Use an existing connection option.
From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.
In the File Name field, browse to, or enter the path pointing to the data to be used in the file system.
From the Type list, select the type of the file to be processed.
In the Row separator field, you can identify the end of a row.
In the Field separator field, you can enter a character, a string, or a regular expression to separate fields for the transferred data.
In the Header field, you can set values to ignore the header of the transferred data.

For more information, you can refer to Talend’s documentation on the tHDFSInput component.

For more information, you can refer to Talend’s documentation on the tLogRow component.

Run the job.

Write Files

Create a new job in Talend.
Add the following components:
- tHDFSConnection, to establish an HDFS connection to be reused by other HDFS components in your job.
- tFileInputDelimited, to read a delimited file row by row to split them up into fields and then send the fields as defined in the schema to the next component.
- tHDFSOutput, to write data into a given HDFS.
Link these components as follows:
- Link tHDFSConnection and tFileInputDelimited with the OnSubjobOk connection.
- Link tFileInputDelimited and tHDFSOutput with the Main connection.

Double-click each component and configure their settings as follows:

tHDFSConnection
tFileInputDelimited
tHDFSOutput

In the Basic settings tab:

From the Property type list, select Built-in so that no property data is stored centrally.
From the Distribution list, select the Cloudera cluster.
From the Version list, select the latest version of Cloudera.
In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"
In the Username field, enter the HDFS user authentication name.

Enter the following Hadoop properties:

Property Value

Property	Value
`"dfs.nameservices"`	`"nameservice1"`
`"dfs.ha.namenodes.cluster"`	`"nn1,nn2"`
`"dfs.client.failover.proxy.provider.cluster"`	`"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"`
`"dfs.namenode.rpc-address.cluster.nn1"`	`"nn1:8020"`
`"dfs.namenode.rpc-address.cluster.nn2"`	`"nn2:8020"`

"dfs.nameservices"

"nameservice1"

"dfs.ha.namenodes.cluster"

"nn1,nn2"

"dfs.client.failover.proxy.provider.cluster"

"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"

"dfs.namenode.rpc-address.cluster.nn1"

"nn1:8020"

"dfs.namenode.rpc-address.cluster.nn2"

"nn2:8020"

Disable the Use datanode hostname option.

For more information, you can refer to Talend’s documentation on the tHDFSConnection component.

In the Basic settings tab:

From the Property type list, select Built-in so that no property data is stored centrally.
In the File Name/Stream field, enter the name and path of the file to be processed.
In the Row separator field, you can identify the end of a row.
In the Field separator field, you can enter a character, a string, or a regular expression to separate fields for the transferred data.
You can select the CSV options option and define your options.
In the Header field, enter the number of rows to be skipped in the beginning of file.
In the Footer field, enter the number of rows to be skipped at the end of the file.
In the Limit field, enter the maximum number of rows to be processed.
From the Schema list, select Built-In to create and store the schema locally for this component only.
Select the Skip empty rows option.

For more information, you can refer to Talend’s documentation on the tFileInputDelimited component.

In the Basic settings tab:

From the Property type list, select Built-in so that no property data is stored centrally.
From the Schema list, select Built-In to create and store the schema locally for this component only.
Click Edit schema to make changes to the schema, and add a flow variable as Input and Output.
Select the Use an existing connection option.
From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.
In the File name field, browse to, or enter the location of the file which you write data to.

This file is created automatically if it does not exist.
From the Type list, select the type of the file to be processed.
From the Action list, select the action that you want to perform on the file.
In the Row separator field, you can identify the end of a row.
In the Field separator field, you can enter a character, a string, or a regular expression to separate fields for the transferred data.

For more information, you can refer to Talend’s documentation on the tHDFSOutput component.

Run the job.

List Files

Create a new job in Talend.
Add the following components:
- tHDFSConnection, to establish an HDFS connection reusable by other HDFS components.
- tHDFSList, to retrieve a list of files or folders in a local directory.
- tHDFSProperties, to create a single row flow that displays the properties of a file processed in HDFS.
- tLogRow, to display the result.
Link these components as follows:
- Link tHDFSConnection and tHDFSList with the OnSubjobOk connection.
- Link tHDFSList and tHDFSProperties with the Iterate connection.
- Link tHDFSProperties and tLogRow with the Main connection.

Double-click each component and configure their settings as follows:

tHDFSConnection
tHDFSList
tHDFSProperties
tLogRow

In the Basic settings tab:

From the Property type list, select Built-in so that no property data is stored centrally.
From the Distribution list, select the Cloudera cluster.
From the Version list, select the latest version of Cloudera.
In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"
In the Username field, enter the HDFS user authentication name.

Enter the following Hadoop properties:

Property Value

Property	Value
`"dfs.nameservices"`	`"nameservice1"`
`"dfs.ha.namenodes.cluster"`	`"nn1,nn2"`
`"dfs.client.failover.proxy.provider.cluster"`	`"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"`
`"dfs.namenode.rpc-address.cluster.nn1"`	`"nn1:8020"`
`"dfs.namenode.rpc-address.cluster.nn2"`	`"nn2:8020"`

"dfs.nameservices"

"nameservice1"

"dfs.ha.namenodes.cluster"

"nn1,nn2"

"dfs.client.failover.proxy.provider.cluster"

"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"

"dfs.namenode.rpc-address.cluster.nn1"

"nn1:8020"

"dfs.namenode.rpc-address.cluster.nn2"

"nn2:8020"

Select the Use datanode hostname option.

For more information, you can refer to Talend’s documentation on the tHDFSConnection component.

In the Basic settings tab:

From the Property type list, select Built-in so that no property data is stored centrally.
Select the Use an existing connection option.
From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.
In the HDFS directory field, browse to, or enter the path pointing to the data to be used in the file system.
From the FileList Type list, select the type of input you want to iterate on from the list.
Select the Use Glob Expressions as Filemask option.
In the Files field, add as many File mask as required.
You can use the Order by and Order action features to sort your data.

For more information, you can refer to Talend’s documentation on the tHDFSList component.

In the Basic settings tab:

From the Property type list, select Built-in so that no property data is stored centrally.
Select the Use an existing connection option.
From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.
From the Schema list, select Built-In to create and store the schema locally for this component only.
In the File field, browse to, or enter the path pointing to the data to be used in the file system.

For more information, you can refer to Talend’s documentation on the tHDFSProperties component.

For more information, you can refer to Talend’s documentation on the tLogRow component.

Run the job.

With Kerberos

Before you begin:

To follow this procedure, you must have a Repository containing the information of the Saagie platform. To this end, create a context group and define the contexts and variables required.

In this case, define the following context variables:

Name Type Value

Name	Type	Value
`Kerberos_Login`	String	The Kerberos user authentication name
`Kerberos_Pwd`	String	The Kerberos user account password
`Kerberos_Principal_Name`	String	The Kerberos principal name
`Uri_Path`	String	URI to store file in HDFS

Kerberos_Login

String

The Kerberos user authentication name

Kerberos_Pwd

String

The Kerberos user account password

Kerberos_Principal_Name

String

The Kerberos principal name

Uri_Path

String

URI to store file in HDFS

Once your context group is created and stored in the Repository, you can apply the Repository context variables to your jobs.

For more information, read the whole section on Using contexts and variables on Talend Studio User Guide.

Read Files With a Kerberized Cluster

Create a new job in Talend.
Add the following components:
- tSystem, to establish the Kerberos connection.
- tHDFSConnection, to establish a connection to an HDFS database to be reused by other HDFS components in your job.
- tHDFSInput, to extract data in an HDFS file for other components to process it.
- tLogRow, to display the result.
Link these components as follows:
- Link tSystem and tHDFSConnection with the OnSubjobOk connection.
- Link tHDFSConnection and tHDFSInput with the OnComponentOk connection.
- Link tHDFSInput and tLogRow with the Main connection.
Double-click each component and configure their settings as follows:
- tSystem
- tHDFSConnection
- tHDFSInput
- tLogRow
In the Basic settings tab:
Select the Use Array Command option.
It will activate its Command field.

In this Command field, enter the following system command in array, one parameter per line.

"/bin/bash" "-c" "echo '"+context.kerberos_pwd+"' | kinit "+context.kerberos_login

From the Standard Output and Error Output list, select the type of output for the processed data to be transferred to.

From the Schema list, select Built-In to create and store the schema locally for this component only.
For more information, you can refer to Talend’s documentation on the tSystem component.
In the Basic settings tab:
From the Property type list, select Built-in so that no property data is stored centrally.

From the Distribution list, select the Cloudera cluster.

From the Version list, select the latest version of Cloudera.

In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"

Select the Use kerberos authentication option, and then enter the relevant parameters in the fields that appear.

Disable the Use datanode hostname option.
For more information, you can refer to Talend’s documentation on the tHDFSConnection component.
In the Basic settings tab:
From the Property type list, select Built-in so that no property data is stored centrally.

From the Schema list, select Built-In to create and store the schema locally for this component only.

Select the Use an existing connection option.

From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.

In the File Name field, browse to, or enter the path pointing to the data to be used in the file system.

From the Type list, select the type of the file to be processed.

In the Row separator field, you can identify the end of a row.

In the Field separator field, you can enter a character, a string, or a regular expression to separate fields for the transferred data.

In the Header field, you can set values to ignore the header of the transferred data.
For more information, you can refer to Talend’s documentation on the tHDFSInput component.
For more information, you can refer to Talend’s documentation on the tLogRow component.
Run the job.

Write Files With a Kerberized Cluster

Create a new job in Talend.
Add the following components:
- tSystem, to establish the Kerberos connection.
- tHDFSConnection, to establish an HDFS connection to be reused by other HDFS components in your job.
- tRowGenerator, to read a delimited file row by row to split them up into fields and then send the fields as defined in the schema to the next component.
- tHDFSOutput, to write data into a given HDFS.
Link these components as follows:
- Link tSystem and tHDFSConnection with the OnSubjobOk connection.
- Link tHDFSConnection and tRowGenerator with the OnComponentOk connection.
- Link tRowGenerator and tHDFSOutput with the Main connection.
Double-click each component and configure their settings as follows:
- tSystem
- tHDFSConnection
- tRowGenerator
- tHDFSOutput
In the Basic settings tab:
Select the Use Array Command option.
It will activate its Command field.

In this Command field, enter the following system command in array, one parameter per line.

"/bin/bash" "-c" "echo '"+context.kerberos_pwd+"' | kinit "+context.kerberos_login

From the Standard Output and Error Output list, select the type of output for the processed data to be transferred to.

From the Schema list, select Built-In to create and store the schema locally for this component only.
For more information, you can refer to Talend’s documentation on the tSystem component.
In the Basic settings tab:
From the Property type list, select Built-in so that no property data is stored centrally.

From the Distribution list, select the Cloudera cluster.

From the Version list, select the latest version of Cloudera.

In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"

Select the Use kerberos authentication option, and then enter the relevant parameters in the fields that appear.

Disable the Use datanode hostname option.
For more information, you can refer to Talend’s documentation on the tHDFSConnection component.
For more information, you can refer to Talend’s documentation on the tRowGenerator component.
In the Basic settings tab:
From the Property type list, select Built-in so that no property data is stored centrally.

From the Schema list, select Built-In to create and store the schema locally for this component only.

Click Edit schema to make changes to the schema, if needed.

Select the Use an existing connection option.

From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.

In the File name field, browse to, or enter the location of the file, which you write data to.

This file is created automatically if it does not exist.

From the Type list, select the type of the file to be processed.

From the Action list, select the action that you want to perform on the file.

In the Row separator field, you can identify the end of a row.

In the Field separator field, you can enter a character, a string, or a regular expression to separate fields for the transferred data.
For more information, you can refer to Talend’s documentation on the tHDFSOutput component.
Run the job.

List Files With a Kerberized Cluster

Create a new job in Talend.
Add the following components:
- tSystem, to establish the Kerberos connection.
- tHDFSConnection, to establish an HDFS connection reusable by other HDFS components.
- tHDFSList, to retrieve a list of files or folders in a local directory.
- tRowGenerator, to generate data.
- tLogRow, to display the result.
Link these components as follows:
- Link tSystem and tHDFSConnection with the OnSubjobOk connection.
- Link tHDFSConnection and tHDFSList with the OnSubjobOk connection.
- Link tHDFSList and tRowGenerator with the Iterate connection.
- Link tRowGenerator and tLogRow with the Main connection.
Double-click each component and configure their settings as follows:
- tSystem
- tHDFSConnection
- tHDFSList
- tRowGenerator
- tLogRow
In the Basic settings tab:
Select the Use Array Command option.
It will activate its Command field.

In this Command field, enter the following system command in array, one parameter per line.

"/bin/bash" "-c" "echo '"+context.kerberos_pwd+"' | kinit "+context.kerberos_login

From the Standard Output and Error Output list, select the type of output for the processed data to be transferred to.

From the Schema list, select Built-In to create and store the schema locally for this component only.
For more information, you can refer to Talend’s documentation on the tSystem component.
In the Basic settings tab:
From the Property type list, select Built-in so that no property data is stored centrally.

From the Distribution list, select the Cloudera cluster.

From the Version list, select the latest version of Cloudera.

In the NameNode URI field, enter the URI of the Hadoop NameNode, the master node of a Hadoop system. It must respect the following pattern: hdfs://ip_hdfs:port_hdfs. Use context variables if possible: "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"

Select the Use kerberos authentication option, and then enter the relevant parameters in the fields that appear.

Disable the Use datanode hostname option.
For more information, you can refer to Talend’s documentation on the tHDFSConnection component.
In the Basic settings tab:
From the Property type list, select Built-in so that no property data is stored centrally.

Select the Use an existing connection option.

From the Component List list, select the connection component tHDFSConnection to reuse the connection details already defined.

In the HDFS directory field, browse to, or enter the path pointing to the data to be used in the file system.

From the FileList Type list, select the type of input you want to iterate on from the list.

Select the Use Glob Expressions as Filemask option.

In the Files field, add as many File mask as required.

You can use the Order by and Order action features to sort your data.
For more information, you can refer to Talend’s documentation on the tHDFSList component.
Define the structure of data to be generated as follows:
Add a column by clicking the plus (+) button.

Define the nature of the data in the Type column by selecting a value from the list.

Set the environment variable.

In the Number of Rows for RowGenerator field, enter the number of rows to generate.
For more information, you can refer to Talend’s documentation on the tRowGenerator component.
For more information, you can refer to Talend’s documentation on the tLogRow component.
Run the job.