Read and Write Tables From Hive With Python

How to read and write tables from Hive with Python. There are two option to query Hive with Python, namely Impyla and Ibis.

Impyla is a Python client for HiveServer2 implementations, like Impala and Hive, for distributed query engines.

Ibis provides higher-level functionalities for Hive and Impala, including a pandas-like interface for distributed data sets.

If you cannot connect directly to HDFS via WebHDFS, Ibis will not allow you to write data in Hive (read only). In this case, use Impyla, which does not require any HDFS connection to read and write data with Hive.

Read and Write Tables From Hive with Python Using Impyla

  1. Install the following packages:

    from impala.dbapi import connect
    from impala.util import as_pandas
    import pandas as pd
    import os
  2. Connect to Hive by running the following lines of code:

    # To connect to Hive by providing the IP and port of the Hive host.
    conn = connect(host=os.environ['IP_HIVE'], port=10000, user=os.environ['USER'],
     password=os.environ['PASSWORD'], auth_mechanism='PLAIN')
  3. You can now read and write tables from Hive by running the following lines of code:

    • Read Tables

    • Write Tables

    # ====== To Read Tables ====== #
    # To select data with a SQL query and retrieve it in a pandas DataFrame.
    cursor = conn.cursor()
    cursor.execute('SELECT * FROM default.helloworld LIMIT 100')
    df = as_pandas(cursor)
    print(df.head())
    # ====== To Write Tables ====== #
    cursor = conn.cursor()
    cursor.execute('CREATE TABLE default.helloworld (hello STRING,world STRING)')
    cursor.execute("insert into default.helloworld values ('hello1','world1')")

Read and Write Tables From Hive with Python Using Ibis

  1. Install the following packages:

    import ibis
    import pandas as pd
    import os
  2. Connect to Hive by running the following lines of code:

    # To connect to Hive by providing the IP and port of the Hive host
    # and a WebHDFS client
    
    hdfs = ibis.impala.hdfs_connect(host=os.environ['IP_HDFS'], port=50070)
    client_hive = ibis.impala.connect(host=os.environ['IP_HIVE'], port=10000, \
      hdfs_client=hdfs, user=os.environ['USER'], password=os.environ['PASSWORD'], \
      auth_mechanism='PLAIN')
    • Providing an HDFS connection is optional and is only required to write data to Hive.

    • The default port is 10000.

  3. You can now read and write tables from Hive by running the following lines of code:

    • Read Tables

    • Write Tables

    # ====== To Read Tables ====== #
    # To select data with a SQL query.
    # limit=None: To get the whole table, otherwise there will be only the 10000 first lines.
    requete = client_hive.sql('select * from helloworld')
    df = requete.execute(limit=None)
    # ====== To Write Tables ====== #
    # To create a simple pandas DataFrame with two columns.
    liste_hello = ['hello1','hello2']
    liste_world = ['world1','world2']
    df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})
    
    # To write Dataframe to Hive if the table name does not exist.
    db = client_hive.database('default')
    if not client_hive.exists_table('default.helloworld'):
      db.create_table('helloworld', df)

Read and Write Tables From Hive With a Kerberized Cluster

Before connecting to HDFS with a Kerberized cluster, you must get a valid ticket by running a kinit command.

  1. To get a valid ticket with a kinit command, you can either:

    • Run the following bash command in a terminal in Jupyter, which will prompt you for your password:

      kinit myusername
    • Run the following bash command in your Python job on Saagie, directly in the command line:

      echo $MY_USER_PASSWORD | kinit myusername
      python {file} arg1 arg2
    • Add the following lines of code directly into your Python code:

      import os
      import subprocess
      
      password = subprocess.Popen(('echo', os.environ['MY_USER_PASSWORD']), stdout=subprocess.PIPE)
      subprocess.call(('kinit', os.environ['MY_USER_LOGIN']), stdin=password.stdout)
  2. Connect to your Kerberized cluster with HDFS by running the following lines of code:

    hdfs = ibis.hdfs_connect(host=os.environ['HIVE_HOSTNAME'], port=50470, use_https=True, verify=False, auth_mechanism='GSSAPI')
    client_hive = ibis.impala.connect(host=os.environ['HIVE_HOSTNAME'], port=10000, hdfs_client=hdfs, auth_mechanism="GSSAPI", use_ssl=False, kerberos_service_name="hive")
  3. You can now read and write tables from Hive.