Read and Write Tables From Hive With Python
Impyla is a Python client for HiveServer2 implementations, like Impala and Hive, for distributed query engines. Ibis provides higher-level functionalities for Hive and Impala, including a pandas-like interface for distributed data sets. If you cannot connect directly to HDFS via WebHDFS, Ibis will not allow you to write data in Hive (read only). In this case, use Impyla, which does not require any HDFS connection to read and write data with Hive. |
Read and Write Tables From Hive with Python Using Impyla
-
Install the following packages:
from impala.dbapi import connect from impala.util import as_pandas import pandas as pd import os
-
Connect to Hive by running the following lines of code:
# To connect to Hive by providing the IP and port of the Hive host. conn = connect(host=os.environ['IP_HIVE'], port=10000, user=os.environ['USER'], password=os.environ['PASSWORD'], auth_mechanism='PLAIN')
-
You can now read and write tables from Hive by running the following lines of code:
# ====== To Read Tables ====== # # To select data with a SQL query and retrieve it in a pandas DataFrame. cursor = conn.cursor() cursor.execute('SELECT * FROM default.helloworld LIMIT 100') df = as_pandas(cursor) print(df.head())
# ====== To Write Tables ====== # cursor = conn.cursor() cursor.execute('CREATE TABLE default.helloworld (hello STRING,world STRING)') cursor.execute("insert into default.helloworld values ('hello1','world1')")
Read and Write Tables From Hive with Python Using Ibis
-
Install the following packages:
import ibis import pandas as pd import os
-
Connect to Hive by running the following lines of code:
# To connect to Hive by providing the IP and port of the Hive host # and a WebHDFS client hdfs = ibis.impala.hdfs_connect(host=os.environ['IP_HDFS'], port=50070) client_hive = ibis.impala.connect(host=os.environ['IP_HIVE'], port=10000, \ hdfs_client=hdfs, user=os.environ['USER'], password=os.environ['PASSWORD'], \ auth_mechanism='PLAIN')
-
Providing an HDFS connection is optional and is only required to write data to Hive.
-
The default port is
10000
.
-
-
You can now read and write tables from Hive by running the following lines of code:
# ====== To Read Tables ====== # # To select data with a SQL query. # limit=None: To get the whole table, otherwise there will be only the 10000 first lines. requete = client_hive.sql('select * from helloworld') df = requete.execute(limit=None)
# ====== To Write Tables ====== # # To create a simple pandas DataFrame with two columns. liste_hello = ['hello1','hello2'] liste_world = ['world1','world2'] df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world}) # To write Dataframe to Hive if the table name does not exist. db = client_hive.database('default') if not client_hive.exists_table('default.helloworld'): db.create_table('helloworld', df)
Read and Write Tables From Hive With a Kerberized Cluster
Before connecting to HDFS with a Kerberized cluster, you must get a valid ticket by running a kinit
command.
-
To get a valid ticket with a
kinit
command, you can either:-
Run the following
bash
command in a terminal in Jupyter, which will prompt you for your password:kinit myusername
-
Run the following
bash
command in your Python job on Saagie, directly in the command line:echo $MY_USER_PASSWORD | kinit myusername python {file} arg1 arg2
-
Add the following lines of code directly into your Python code:
import os import subprocess password = subprocess.Popen(('echo', os.environ['MY_USER_PASSWORD']), stdout=subprocess.PIPE) subprocess.call(('kinit', os.environ['MY_USER_LOGIN']), stdin=password.stdout)
-
-
Connect to your Kerberized cluster with HDFS by running the following lines of code:
hdfs = ibis.hdfs_connect(host=os.environ['HIVE_HOSTNAME'], port=50470, use_https=True, verify=False, auth_mechanism='GSSAPI') client_hive = ibis.impala.connect(host=os.environ['HIVE_HOSTNAME'], port=10000, hdfs_client=hdfs, auth_mechanism="GSSAPI", use_ssl=False, kerberos_service_name="hive")
-
You can now read and write tables from Hive.