Read and Write Files From HDFS With Python
Without a Kerberized Cluster
-
Install the hdfs package and import the other packages by running the following lines of code:
import pandas as pd from hdfs import InsecureClient import os
-
Connect to a HDFS by running the following lines of code:
# To connect to WebHDFS by providing the IP of the HDFS host and the WebHDFS port. client_hdfs = InsecureClient('http://hdfs_ip:50070', user='my_user')
The HDFS connection URL format must be
http://hdfs_ip:hdfs_port
where:-
The IP address must be replaced with the
HDFS_IP
of your platform. -
The WebHDFS default port is
50070
.
We recommend that you specify the user with
user='my_user'
when connecting to HDFS. -
-
You can now read and write files from HDFS by running the following lines of code:
// ==== To read file. with client_hdfs.read('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as reader: df = pd.read_csv(reader,index_col=0)
// ==== To write file. # To create a simple pandas DataFrame. liste_hello = ['hello1','hello2'] liste_world = ['world1','world2'] df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world}) # To write a Dataframe to HDFS. with client_hdfs.write('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as writer: df.to_csv(writer)
With a Kerberized Cluster
Before connecting to HDFS with a Kerberized cluster, you must get a valid ticket by running a kinit
command.
-
To get a valid ticket with a
kinit
command, you can either:-
Run the following
bash
command in a terminal in Jupyter, which will prompt you for your password:kinit myusername
-
Run the following
bash
command in your Python job on Saagie, directly in the command line:echo $MY_USER_PASSWORD | kinit myusername python {file} arg1 arg2
-
Add the following lines of code directly into your Python code:
import os import subprocess password = subprocess.Popen(('echo', os.environ['MY_USER_PASSWORD']), stdout=subprocess.PIPE) subprocess.call(('kinit', os.environ['MY_USER_LOGIN']), stdin=password.stdout)
-
-
Connect to your Kerberized cluster with HDFS by running the following lines of code:
import pandas as pd from hdfs.ext.kerberos import KerberosClient import requests session = requests.Session() session.verify = False client = KerberosClient('https://'+os.environ['HDFS_HOSTNAME']+':50470',mutual_auth="REQUIRED",session=session)
-
You can now read and write files from HDFS.
// ==== To read file. with client_hdfs.read('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as reader: df = pd.read_csv(reader,index_col=0)
// ==== To write file. # To create a simple pandas DataFrame. liste_hello = ['hello1','hello2'] liste_world = ['world1','world2'] df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world}) # To write a Dataframe to HDFS. with client_hdfs.write('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as writer: df.to_csv(writer)