Read and Write Files From HDFS With Python
Without a Kerberized Cluster
- 
Install the hdfs package and import the other packages by running the following lines of code:
import pandas as pd from hdfs import InsecureClient import os
 - 
Connect to a HDFS by running the following lines of code:
# To connect to WebHDFS by providing the IP of the HDFS host and the WebHDFS port. client_hdfs = InsecureClient('http://hdfs_ip:50070', user='my_user')The HDFS connection URL format must be
http://hdfs_ip:hdfs_portwhere:- 
The IP address must be replaced with the
HDFS_IPof your platform. - 
The WebHDFS default port is
50070. 
We recommend that you specify the user with
user='my_user'when connecting to HDFS. - 
 - 
You can now read and write files from HDFS by running the following lines of code:
// ==== To read file. with client_hdfs.read('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as reader: df = pd.read_csv(reader,index_col=0)// ==== To write file. # To create a simple pandas DataFrame. liste_hello = ['hello1','hello2'] liste_world = ['world1','world2'] df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world}) # To write a Dataframe to HDFS. with client_hdfs.write('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as writer: df.to_csv(writer) 
With a Kerberized Cluster
Before connecting to HDFS with a Kerberized cluster, you must get a valid ticket by running a kinit command.
- 
To get a valid ticket with a
kinitcommand, you can either:- 
Run the following
bashcommand in a terminal in Jupyter, which will prompt you for your password:kinit myusername
 - 
Run the following
bashcommand in your Python job on Saagie, directly in the command line:echo $MY_USER_PASSWORD | kinit myusername python {file} arg1 arg2 - 
Add the following lines of code directly into your Python code:
import os import subprocess password = subprocess.Popen(('echo', os.environ['MY_USER_PASSWORD']), stdout=subprocess.PIPE) subprocess.call(('kinit', os.environ['MY_USER_LOGIN']), stdin=password.stdout) 
 - 
 - 
Connect to your Kerberized cluster with HDFS by running the following lines of code:
import pandas as pd from hdfs.ext.kerberos import KerberosClient import requests session = requests.Session() session.verify = False client = KerberosClient('https://'+os.environ['HDFS_HOSTNAME']+':50470',mutual_auth="REQUIRED",session=session) - 
You can now read and write files from HDFS.
// ==== To read file. with client_hdfs.read('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as reader: df = pd.read_csv(reader,index_col=0)// ==== To write file. # To create a simple pandas DataFrame. liste_hello = ['hello1','hello2'] liste_world = ['world1','world2'] df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world}) # To write a Dataframe to HDFS. with client_hdfs.write('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as writer: df.to_csv(writer)