Read and Write Tables From Impala With Python
Impyla is a Python client for HiveServer2 implementations, like Impala and Hive, for distributed query engines. Ibis provides higher-level functionalities for Hive and Impala, including a pandas-like interface for distributed data sets. If you cannot connect directly to HDFS via WebHDFS, Ibis will not allow you to write data in Impala (read only). In this case, use Impyla, which does not require any HDFS connection to read and write data with Impala. |
Read and Write Tables From Impala with Python Using Impyla
-
Install the following packages:
from impala.dbapi import connect from impala.util import as_pandas import pandas as pd import os
-
Connect to Impala by running the following lines of code:
# To connect to Impala by providing the IP and port of the Impala host. conn = connect(host=os.environ['IP_IMPALA'], port=21050, user=os.environ['USER'], password=os.environ['PASSWORD'], auth_mechanism='PLAIN')
The default port is 21050
. -
You can now read and write tables from Impala by running the following lines of code:
# ====== To Read Tables ====== # # To select data with a SQL query and retrieve it in a pandas DataFrame. cursor = conn.cursor() cursor.execute('SELECT * FROM default.helloworld LIMIT 100') df = as_pandas(cursor) print(df.head())
# ====== To Write Tables ====== # cursor = conn.cursor() cursor.execute('CREATE TABLE default.helloworld (hello STRING,world STRING)') cursor.execute("insert into default.helloworld values ('hello1','world1')")
Read and Write Tables From Impala with Python Using Ibis
-
Install the following packages:
import ibis import pandas as pd import os
-
Connect to Impala by running the following lines of code:
# To Connect to Impala by providing the IP and port of the Impala host # and a WebHDFS client hdfs = ibis.hdfs_connect(host=os.environ['IP_HDFS'], port=50070) client_impala = ibis.impala.connect(host=os.environ['IP_IMPALA'], port=21050, \ hdfs_client=hdfs, user=os.environ['USER'], password=os.environ['PASSWORD'], \ auth_mechanism='PLAIN')
If your Impala is SSL secured, add the following parameters to the
ibis.impala.connect()
command:-
use_ssl=True
: This parameter is mandatory and allows you to communicate with the server via SSL. -
ca_cert=None
: This parameter is optional. If it is not defined, the certificate will not be validated, which can be a potential security issue.
-
Providing an HDFS connection is optional and is only required to write data to Impala.
-
The default port is
21050
.
-
-
You can now read and write tables from Impala by running the following lines of code:
# ====== To Read Tables ====== # # To select data with a SQL query. # limit=None: To get the whole table, otherwise there will be only the 10000 first lines. requete =client_impala.sql('select * from helloworld') df = requete.execute(limit=None)
# ====== To Write Tables ====== # # To create a simple pandas DataFrame with two columns. liste_hello = ['hello1','hello2'] liste_world = ['world1','world2'] df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world}) # To write Dataframe to Impala if the table name does not exist. db =client_impala.database('default') if not client_impala.exists_table('default.helloworld'): db.create_table('helloworld', df)
# To write in table C the join between tables A and B. client_impala.raw_sql('CREATE TABLE c STORED AS PARQUET AS SELECT a.col1, b.col2 FROM a INNER JOIN b ON (a.id=b.id)') # There is no incoming data in Python.
Read and Write Tables From Impala With a Kerberized Cluster
Before connecting to HDFS with a Kerberized cluster, you must get a valid ticket by running a kinit
command.
-
To get a valid ticket with a
kinit
command, you can either:-
Run the following
bash
command in a terminal in Jupyter, which will prompt you for your password:kinit myusername
-
Run the following
bash
command in your Python job on Saagie, directly in the command line:echo $MY_USER_PASSWORD | kinit myusername python {file} arg1 arg2
-
Add the following lines of code directly into your Python code:
import os import subprocess password = subprocess.Popen(('echo', os.environ['MY_USER_PASSWORD']), stdout=subprocess.PIPE) subprocess.call(('kinit', os.environ['MY_USER_LOGIN']), stdin=password.stdout)
-
-
Connect to your Kerberized cluster with HDFS by running the following lines of code:
hdfs = ibis.hdfs_connect(host=os.environ['HIVE_HOSTNAME'], port=50470, use_https=True, verify=False, auth_mechanism='GSSAPI') client_impala = ibis.impala.connect(host=os.environ['IMPALA_HOSTNAME'], port=21050, hdfs_client=hdfs, auth_mechanism="GSSAPI", use_ssl=False, kerberos_service_name="impala")
-
You can now read and write tables from Impala.