Read and Write Tables From Impala With Python

How to read and write tables from Impala with Python. There are two option to query Impala with Python, namely Impyla and Ibis.

Impyla is a Python client for HiveServer2 implementations, like Impala and Hive, for distributed query engines.

Ibis provides higher-level functionalities for Hive and Impala, including a pandas-like interface for distributed data sets.

If you cannot connect directly to HDFS via WebHDFS, Ibis will not allow you to write data in Impala (read only). In this case, use Impyla which does not require any HDFS connection to read and write data with Impala.

Read and Write Tables From Impala with Python Using Impyla

  1. Install the following packages:

    from impala.dbapi import connect
    from impala.util import as_pandas
    import pandas as pd
    import os
  2. Connect to Impala by running the following lines of code:

    # To connect to Impala by providing the IP and port of the Impala host.
    conn = connect(host=os.environ['IP_IMPALA'], port=21050, user=os.environ['USER'], password=os.environ['PASSWORD'], auth_mechanism='PLAIN')
    The default port is 21050.
  3. You can now write and query Impala tables by running the following lines of code:

    • Write Tables in Impala

    • Query Tables From Impala

    cursor = conn.cursor()
    cursor.execute('CREATE TABLE default.helloworld (hello STRING,world STRING)')
    cursor.execute("insert into default.helloworld values ('hello1','world1')")
    # ====== To Read Tables ====== #
    # To select data with a SQL query and retrieve it in a pandas DataFrame.
    
    cursor = conn.cursor()
    cursor.execute('SELECT * FROM default.helloworld LIMIT 100')
    df = as_pandas(cursor)
    print(df.head())

Read and Write Tables From Impala with Python Using Ibis

  1. Install the following packages:

    import ibis
    import pandas as pd
    import os
  2. Connect to Impala by running the following lines of code:

    # To Connect to Impala by providing the IP and port of the Impala host
    # and a WebHDFS client
    
    hdfs = ibis.hdfs_connect(host=os.environ['IP_HDFS'], port=50070)
    client_impala = ibis.impala.connect(host=os.environ['IP_IMPALA'], port=21050, \
      hdfs_client=hdfs, user=os.environ['USER'], password=os.environ['PASSWORD'], \
      auth_mechanism='PLAIN')

    If your Impala is SSL secured, add the following parameters to the ibis.impala.connect() command:

    • use_ssl=True: This parameter is mandatory and allows you to communicate with the server via SSL.

    • ca_cert=None: This parameter is optional. If it is not defined, the certificate will not be validated, which can be a potential security issue.

    • Providing an HDFS connection is optional and is only required to write data to Impala.

    • The default port is 21050.

  3. You can now write and query Impala tables by running the following lines of code:

    • Write Tables in Impala

    • Write Tables in Impala With Impala Table Sources

    • Query Tables From Impala

    # To create a simple pandas DataFrame with two columns.
    liste_hello = ['hello1','hello2']
    liste_world = ['world1','world2']
    df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})
    
    # To write Dataframe to Impala if the table name does not exist.
    db =client_impala.database('default')
    if not client_impala.exists_table('default.helloworld'):
     db.create_table('helloworld', df)
    # To write in table C the join between tables A and B.
    client_impala.raw_sql('CREATE TABLE c STORED AS PARQUET AS SELECT a.col1, b.col2 FROM a INNER JOIN b ON (a.id=b.id)')
    # There is no incoming data in Python.
    # ====== To Read Tables ====== #
    # To select data with a SQL query.
    # limit=None: To get the whole table, otherwise there will be only the 10000 first lines.
    requete =client_impala.sql('select * from helloworld')
    df = requete.execute(limit=None)

Read and Write Files From Impala With a Kerberized Cluster

Before connecting to HDFS with a kerberized cluster, you must get a valid ticket by running a kinit command.

  1. To get a valid ticket with a kinit command, you can either:

    • Run the following bash command in a terminal in Jupyter, which will prompt you for your password:

      kinit myusername
    • Run the following bash command in your Python job on Saagie, directly in the command line:

      echo $MY_USER_PASSWORD | kinit myusername
      python {file} arg1 arg2
    • Add the following lines of code directly into your Python code:

      import os
      import subprocess
      
      password = subprocess.Popen(('echo', os.environ['MY_USER_PASSWORD']), stdout=subprocess.PIPE)
      subprocess.call(('kinit', os.environ['MY_USER_LOGIN']), stdin=password.stdout)
  2. Connect to your kerberized cluster with HDFS by running the following lines of code:

    hdfs = ibis.hdfs_connect(host=os.environ['HIVE_HOSTNAME'], port=50470, use_https=True, verify=False, auth_mechanism='GSSAPI')
    client_impala = ibis.impala.connect(host=os.environ['IMPALA_HOSTNAME'], port=21050, hdfs_client=hdfs, auth_mechanism="GSSAPI", use_ssl=False, kerberos_service_name="impala")
  3. You can now write and query Impala tables.