Read and Write Tables From Hive With Python

How to read and write tables from Hive with Python. There are two option to query Hive with Python, namely Impyla and Ibis.

Impyla is a Python client for HiveServer2 implementations, like Impala and Hive, for distributed query engines.

Ibis provides higher-level functionalities for Hive and Impala, including a pandas-like interface for distributed data sets.

If you cannot connect directly to HDFS via WebHDFS, Ibis will not allow you to write data in Hive (read only). In this case, use Impyla which does not require any HDFS connection to read and write data with Hive.

Read and Write Tables From Hive with Python Using Impyla

  1. Install the following packages:

    from impala.dbapi import connect
    from impala.util import as_pandas
    import pandas as pd
    import os
  2. Connect to Hive by running the following lines of code:

    # To connect to Hive by providing the IP and port of the Hive host.
    conn = connect(host=os.environ['IP_HIVE'], port=10000, user=os.environ['USER'],
     password=os.environ['PASSWORD'], auth_mechanism='PLAIN')
  3. You can now write and query Hive tables by running the following lines of code:

    • Write Tables in Hive

    • Query Tables From Hive

    cursor = conn.cursor()
    cursor.execute('CREATE TABLE default.helloworld (hello STRING,world STRING)')
    cursor.execute("insert into default.helloworld values ('hello1','world1')")
    # ====== To Read Tables ====== #
    # To select data with a SQL query and retrieve it in a pandas DataFrame.
    
    cursor = conn.cursor()
    cursor.execute('SELECT * FROM default.helloworld LIMIT 100')
    df = as_pandas(cursor)
    print(df.head())

Read and Write Tables From Hive with Python Using Ibis

  1. Install the following packages:

    import ibis
    import pandas as pd
    import os
  2. Connect to Hive by running the following lines of code:

    # To connect to Hive by providing the IP and port of the Hive host
    # and a WebHDFS client
    
    hdfs = ibis.impala.hdfs_connect(host=os.environ['IP_HDFS'], port=50070)
    client_hive = ibis.impala.connect(host=os.environ['IP_HIVE'], port=10000, \
      hdfs_client=hdfs, user=os.environ['USER'], password=os.environ['PASSWORD'], \
      auth_mechanism='PLAIN')
    • Providing an HDFS connection is optional and is only required to write data to Hive.

    • The default port is 10000.

  3. You can now write and query Hive tables by running the following lines of code:

    • Write Tables in Hive

    • Query Tables From Hive

    # To create a simple pandas DataFrame with two columns.
    liste_hello = ['hello1','hello2']
    liste_world = ['world1','world2']
    df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})
    
    # To write Dataframe to Hive if the table name does not exist.
    db = client_hive.database('default')
    if not client_hive.exists_table('default.helloworld'):
      db.create_table('helloworld', df)
    # ====== To Read Tables ====== #
    # To select data with a SQL query.
    # limit=None: To get the whole table, otherwise there will be only the 10000 first lines.
    requete = client_hive.sql('select * from helloworld')
    df = requete.execute(limit=None)

Read and Write Files From Hive With a Kerberized Cluster

Before connecting to HDFS with a kerberized cluster, you must get a valid ticket by running a kinit command.

  1. To get a valid ticket with a kinit command, you can either:

    • Run the following bash command in a terminal in Jupyter, which will prompt you for your password:

      kinit myusername
    • Run the following bash command in your Python job on Saagie, directly in the command line:

      echo $MY_USER_PASSWORD | kinit myusername
      python {file} arg1 arg2
    • Add the following lines of code directly into your Python code:

      import os
      import subprocess
      
      password = subprocess.Popen(('echo', os.environ['MY_USER_PASSWORD']), stdout=subprocess.PIPE)
      subprocess.call(('kinit', os.environ['MY_USER_LOGIN']), stdin=password.stdout)
  2. Connect to your kerberized cluster with HDFS by running the following lines of code:

    hdfs = ibis.hdfs_connect(host=os.environ['HIVE_HOSTNAME'], port=50470, use_https=True, verify=False, auth_mechanism='GSSAPI')
    client_hive = ibis.impala.connect(host=os.environ['HIVE_HOSTNAME'], port=10000, hdfs_client=hdfs, auth_mechanism="GSSAPI", use_ssl=False, kerberos_service_name="hive")
  3. You can now write and query Hive tables.