Read and Write Files From HDFS With Python

How to read and write files from HDFS with Python.

  1. Install the hdfs package and import the other packages by running the following lines of code:

    import pandas as pd
    from hdfs import InsecureClient
    import os
  2. Connect to a HDFS by running the following lines of code:

    # To connect to WebHDFS by providing the IP of the HDFS host and the WebHDFS port.
    client_hdfs = InsecureClient('http://hdfs_ip:50070', user='my_user')

    The HDFS connection URL format must be http://hdfs_ip:hdfs_port where:

    • The IP address must be replaced with the HDFS_IP of your platform.

    • The WebHDFS default port is 50070.

    We recommend that you specify the user with user='my_user' when connecting to HDFS.

  3. You can now read and write files from HDFS by running the following lines of code:

    • Read Files

    • Write Files

    // ==== To read file.
    with client_hdfs.read('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as reader:
      df = pd.read_csv(reader,index_col=0)
    // ==== To write file.
    # To create a simple pandas DataFrame.
    liste_hello = ['hello1','hello2']
    liste_world = ['world1','world2']
    df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})
    
    # To write a Dataframe to HDFS.
    with client_hdfs.write('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as writer:
      df.to_csv(writer)

Read and Write Files From HDFS With a Kerberized Cluster

Before connecting to HDFS with a kerberized cluster, you must get a valid ticket by running a kinit command.

  1. To get a valid ticket with a kinit command, you can either:

    • Run the following bash command in a terminal in Jupyter, which will prompt you for your password:

      kinit myusername
    • Run the following bash command in your Python job on Saagie, directly in the command line:

      echo $MY_USER_PASSWORD | kinit myusername
      python {file} arg1 arg2
    • Add the following lines of code directly into your Python code:

      import os
      import subprocess
      
      password = subprocess.Popen(('echo', os.environ['MY_USER_PASSWORD']), stdout=subprocess.PIPE)
      subprocess.call(('kinit', os.environ['MY_USER_LOGIN']), stdin=password.stdout)
  2. Connect to your kerberized cluster with HDFS by running the following lines of code:

    import pandas as pd
    from hdfs.ext.kerberos import KerberosClient
    import requests
    
    session = requests.Session()
    session.verify = False
    client = KerberosClient('https://'+os.environ['HDFS_HOSTNAME']+':50470',mutual_auth="REQUIRED",session=session)
  3. You can now read and write files from HDFS.