Read and Write Files From HDFS With Python

How to read and write files from HDFS with Python.

Without a Kerberized Cluster

  1. Install the hdfs package and import the other packages by running the following lines of code:

    import pandas as pd
    from hdfs import InsecureClient
    import os
  2. Connect to a HDFS by running the following lines of code:

    # To connect to WebHDFS by providing the IP of the HDFS host and the WebHDFS port.
    client_hdfs = InsecureClient('http://hdfs_ip:50070', user='my_user')

    The HDFS connection URL format must be http://hdfs_ip:hdfs_port where:

    • The IP address must be replaced with the HDFS_IP of your platform.

    • The WebHDFS default port is 50070.

    We recommend that you specify the user with user='my_user' when connecting to HDFS.

  3. You can now read and write files from HDFS by running the following lines of code:

    • Read Files

    • Write Files

    // ==== To read file.
    with client_hdfs.read('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as reader:
      df = pd.read_csv(reader,index_col=0)
    // ==== To write file.
    # To create a simple pandas DataFrame.
    liste_hello = ['hello1','hello2']
    liste_world = ['world1','world2']
    df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})
    
    # To write a Dataframe to HDFS.
    with client_hdfs.write('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as writer:
      df.to_csv(writer)

With a Kerberized Cluster

Before connecting to HDFS with a Kerberized cluster, you must get a valid ticket by running a kinit command.

  1. To get a valid ticket with a kinit command, you can either:

    • Run the following bash command in a terminal in Jupyter, which will prompt you for your password:

      kinit myusername
    • Run the following bash command in your Python job on Saagie, directly in the command line:

      echo $MY_USER_PASSWORD | kinit myusername
      python {file} arg1 arg2
    • Add the following lines of code directly into your Python code:

      import os
      import subprocess
      
      password = subprocess.Popen(('echo', os.environ['MY_USER_PASSWORD']), stdout=subprocess.PIPE)
      subprocess.call(('kinit', os.environ['MY_USER_LOGIN']), stdin=password.stdout)
  2. Connect to your Kerberized cluster with HDFS by running the following lines of code:

    import pandas as pd
    from hdfs.ext.kerberos import KerberosClient
    import requests
    
    session = requests.Session()
    session.verify = False
    client = KerberosClient('https://'+os.environ['HDFS_HOSTNAME']+':50470',mutual_auth="REQUIRED",session=session)
  3. You can now read and write files from HDFS.

    • Read Files

    • Write Files

    // ==== To read file.
    with client_hdfs.read('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as reader:
      df = pd.read_csv(reader,index_col=0)
    // ==== To write file.
    # To create a simple pandas DataFrame.
    liste_hello = ['hello1','hello2']
    liste_world = ['world1','world2']
    df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})
    
    # To write a Dataframe to HDFS.
    with client_hdfs.write('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as writer:
      df.to_csv(writer)