Read and Write Tables From Impala With R

How to read and write tables from Impala with R via ODBC (Open Database Connectivity) or JDBC (Java Database Connectivity).

Connect with ODBC (Open Database Connectivity)

Before you begin

All dependencies and drivers should already be installed on the Saagie platform. If this is not the case, or if you want to connect from outside the platform, read the installation details at the end of the ODBC section.

  1. Authenticate with or without Kerberos.

    • Without Kerberos

    • With Kerberos

    library(DBI)
    
    con <- dbConnect(odbc::odbc(),
                          Driver   = "Cloudera ODBC Driver for Impala 64-bit",
                          Host     = "dn1.pX.company.prod.saagie.io",
                          Port     = 21050,
                          Schema   = "default",
                          AuthMech = 3,
                          UseSASL  = 1,
                          UID      = "user",
                          PWD      = "password")

    Where:

    • Driver is the name of the driver to use. By default, it is Cloudera ODBC Driver for Impala 64-bit on the platform.

      The default name on Windows is usually Cloudera ODBC Driver for Impala, and Cloudera ODBC Driver for Impala 64-bit on Linux.
    • Host is the IP or host name of the Impala database which can be found under Impala  External Connections  Host with the format dn1.pX.company.prod.saagie.io.

    • Port is 21050 by default.

    • Schema is the schema in which the queries are executed.

    • AuthMech is the authentication mechanism. Use value 3 for authentication with user and password.

    • UseSASL is for Simple Authentication and Security Layer. Use value 1 for authentication with user and password.

    • UID is your Saagie username on the platform.

    • PWD is your Saagie password on the platform.

    If your Impala is SSL secured, add the following parameters to your command:

    • SSL=1: This parameter is mandatory and allows you to communicate with the server via SSL.

    • AllowSelfSignedServerCert=1: This parameter is optional and allows authentication using self-signed certificates that have not been added to the list of trusted certificates.

      The certificate chain file is not yet available in Saagie containers.
    • AllowHostNameCNMismatch=1: This parameter is optional and allows the common name of an SSL certificate issued by a certificate authority to not match the host name of the Impala server.

    library(DBI)
    library(getPass)
    
    # Method 1 (interactive) : Use in RStudio. Enter your password in an interactive pop-up window.
    system('kinit user',input=getPass('Enter your password: '))
    
    # Method 2 (scripts) : Use outside of RStudio.
    # Write the password in the command line or store it in an environment variable.
    # Uncomment the following line to use it.
    # system('echo password | kinit user')
    
    con <- dbConnect(odbc::odbc(),
                     Driver = "Cloudera ODBC Driver for Impala 64-bit",
                     Host = "dn1.pX.company.prod.saagie.io",
                     Port = 21050,
                     Schema = "default",
                     AuthMech = 1)

    Where:

    • Driver is the name of the driver to use. By default, it is `Cloudera ODBC Driver for Impala 64-bit ` on the platform.

      The default name on Windows is usually Cloudera ODBC Driver for Impala `, and `Cloudera ODBC Driver for Impala 64-bit on Linux.
    • Host is the IP or host name of the Impala database with the format dn1.pX.company.prod.saagie.io.

    • Port is 21050 by default.

    • Schema is the schema in which the queries are executed.

    • AuthMech is the authentication mechanism. Use value 1 for authentication with Kerberos ticket.

    • UID is your Saagie username on the platform.

    • PWD is your Saagie password on the platform.

  2. You can now interact with Hive tables by running the following lines of code:

    • List Impala Tables

    • Get All Element of an Impala Table

    • Execute Queries in dplyr Syntax

    • Write Impala Tables

    # Troubleshooting: If you only see the first letter of each table, see "Fix Table Display Names in RStudio and dbListTables" below.
    # Another workaround is to use "show schemas" and "show tables" in plain SQL.
    
    # To list all tables of all schemas.
    dbListTables(con)
    dbGetQuery(con, 'show schemas')
    dbGetQuery(con, 'show tables')
    Fix Table Display Names in RStudio and dbListTables
    1. To fix the table display names issue in RStudio and dbListTables, run the following command line with an administrator account:

      sudo sed -i 's/DriverManagerEncoding=UTF-32/DriverManagerEncoding=UTF-16/g' /opt/cloudera/impalaodbc/lib/64/cloudera.impalaodbc.ini
    2. Disconnect from Impala and restart your R session.
      When you connect to Impala again, you can see the full table names displayed.

    sample <- DBI::dbReadTable(con, "sample") # Fetch all data from a table
    # To create a lazy "tbl" from an Impala table.
    forecast_db <- tbl(con, in_schema('forecast', 'forecast_weekly'))
    forecast_db # Print columns and column types
    
    # The query is written in dplyr syntax, but is executed on a remote SQL database.
    query <- forecast_db %>%
        summarise(mean_forecast = forecast %>% as.numeric() %>% mean)
    
    show_query(query) # To show the query that will be executed.
    query # To execute the lazy query when the result is printed.
    
    # Example of verbs used by dplyr.
    forecast_db %>%
        filter(prediction_date == max(prediction_date)) %>%
        group_by(reference) %>%
        summarise(forecast_mean = mean(forecast),
                  forecast_max = max(forecast),
                  nb_forecast_total = n())
    
    sales_db <- tbl(con, in_schema('forecast', 'sales_weekly'))
    sales_db
    
    forecast_db <- forecast_db %>%
        mutate(reference = as.character(reference))
    
    diff_db <- inner_join(forecast_db, sales_db, by=c('reference', 'year', 'week'))
    
    diff_by_ref <- diff_db %>%
        group_by(reference, year, week) %>%
        summarise(diff_by_week = abs(forecast - quantity)) %>% # To differentiate between forecast and reality for each prediction.
        group_by(reference) %>%
        summarise(diff_by_ref = sum(diff_by_week)) # Sum of all differences for each reference.
    
    diff_db # To execute all the lazy queries above.

    The default function for writing from R to Impala only works for small amounts of data. Since HDFS is used by default on Saagie platforms, let’s use it here to speed up the data transfer from R to Impala.

    The example function below works in four steps:

    1. It uploads the data from R to HDFS in /tmp/.

    2. It creates a temporary external table on the uploaded CSV file.

    3. It copies the contents of the temporary table to the final Impala table in parquet format.

    4. It removes the temporary table and the CSV file used.

    Specify the platform-specific HDFS web URL in the function. If an error occurs, verify the permissions of your Impala user and change the HDFS user used.
    #' To write data from R to Impala using HDFS. (1)
    #'
    #' @param con : The Impala connexion.
    #' @param name : The name of the table created in Impala.
    #' @param obj : The R DataFrame.
    #' @param database : The Impala database to use.
    #' @param overwrite : Overwrites the current table if it already exists.
    #'
    fastWriteTable <- function(con, name, obj, database = 'default', overwrite = F) {
      library(DBI)
      library(stringi)
      library(httr)
      library(data.table)
    
      if(!overwrite) {
        showRequest <- paste0("show tables in `", database, "` like '", name, "'")
        exists <- dbGetQuery(con, showRequest)
        if(nrow(exists) > 0) {
          print(paste0('Table ', database, '.', name, ' already exists. Use overwrite = T to overwrite table'))
          return('Table already exists')
        }
      }
    
      # HDFS information.
      hdfsUrl <- "https://nn1.pX.companyName.saagie.io:50470/webhdfs/v1"
      hdfsUser <- '&user.name=impala'
    
      # Informations about the CSV table.
      tableCsv <- paste0('tmp__', name, '__tmp') # Name of the table in Impala
      location <- paste0('/tmp/', tableCsv, '/') # Folder location of the csv file
      separator <- ';' # Separator in the csv file
      nameCsv <- paste0(name, '.csv')
    
      # To select the database.
      dbGetQuery(con, paste0('use ', database))
    
      # Column names and classes.
      cl <- sapply(obj, class)
      n <- names(cl)
    
      # To match R and Impala column types.
      r_types <- c('numeric', 'integer', 'character', 'factor', 'c(\"POSIXct\", \"POSIXt\")')
      impala_types <- c('DOUBLE', 'INTEGER', 'STRING', 'STRING', 'TIMESTAMP')
    
      # To replace R types by Impala types.
      cl <- stringi::stri_replace_all_fixed(cl, r_types, impala_types, vectorize_all=F)
    
      # To create each column as correct type.
      namesTypes <- paste0(dbQuoteIdentifier(con, make.names(n)), ' ', cl, collapse = ', ')
    
      dropRequest <- paste0("DROP TABLE IF EXISTS `", tableCsv, "`")
      dbGetQuery(con, dropRequest)
    
      # To write the CSV to disk and copy it to Impala.
      fwrite(obj, nameCsv, col.names = F, sep = ';')
      set_config( config( ssl_verifypeer = 0L ) )
      uri <- paste0(hdfsUrl, location, nameCsv, '?op=CREATE&overwrite=true', hdfsUser)
      response <- PUT(uri)
      stop_for_status(response)
      uriWrite <- response$url
      response <- PUT(uriWrite, body = upload_file(nameCsv))
      stop_for_status(response)
    
      # To create a table on the CSV file. (2)
      createRequest <- paste0("CREATE TABLE `", tableCsv, "` (", gsub('.', '_', namesTypes, fixed = T), ") row format delimited fields terminated by '", separator, "' stored as textfile location '", location, "'")
      dbGetQuery(con, createRequest)
    
      dropRequest2 <- paste0("DROP TABLE IF EXISTS `", name, "`")
      dbGetQuery(con, dropRequest2)
    
      # To copy the CSV table into a "parquet" table. (3)
      parquetRequest <- paste0("CREATE TABLE `", name, "` STORED AS PARQUET AS SELECT * FROM `", tableCsv, "`")
      dbGetQuery(con, parquetRequest)
    
      # To optimize the table.
      computeRequest <- paste0("COMPUTE STATS `", name, "`")
      dbGetQuery(con, computeRequest)
    
      # To remove the temporary CSV table. (4)
      dropTmpRequest <- paste0("DROP TABLE `", tableCsv, "`")
      dbGetQuery(con, dropTmpRequest)
    
      return('Table correctly uploaded')
    }
OPTIONAL: Installation Details
  1. To install the required dependencies, run the following lines of code:

    pcks <- c('DBI', 'odbc', 'dplyr', 'dbplyr', 'implyr', 'getPass') # List of packages dependencies
    for (pck in pcks[pcks %in% installed.packages()]) install.packages(pck) # Install if necessary

    Where:

    • DBI is a standard database interface.

    • odbc allows to connect to ODBC databases using DBI.

    • dplyr is a data manipulation library.

    • dbplyr allows to convert data manipulation written in R to SQL.

    • implyr allows to convert data manipulation written in R to Impala.

    • getPass is a library to hide password typed in RStudio app.

  2. To connect from outside the platform, download the Impala ODBC driver from Cloudera.

  3. Depending on whether you are using Windows or Linux, do as follows:

    • Under Windows

    • Under Linux

    1. Install the Windows installer (MSI).

    1. Install the odbcinst command line tool:

      sudo apt-get install odbcinst
    2. Install the following Debian (.deb) package:

      sudo dpkg -i clouderaimpalaodbc_*.deb
      You can install other extensions depending on your distribution.
    3. Register the driver as it is installed by running the following command line:

      sudo odbcinst -i -d -f /opt/cloudera/impalaodbc/Setup/odbcinst.ini
    4. Install the following packages:

      if (!length(find.package("odbc", quiet = TRUE))) {
      
       # To install OS package for ODBC.
       system("sudo apt-get install -y unixodbc unixodbc-dev", intern = TRUE)
       # To install R package for ODBC.
       install.packages("odbc", repos = "https://cloud.r-project.org")
      
      }
      
      # If Impala ODBC driver is missing:
      if (!file.exists("/etc/odbcinst.ini") ||
       !length(grep("Cloudera ODBC Driver for Impala 64-bit", readLines("/etc/odbcinst.ini"), ignore.case = TRUE))) {
      
       # To retrieve Impala drivers for Debianoids from Cloudera.
       if (!dir.exists("driver-impala")) dir.create("driver-impala")
       if (!length(Sys.glob("driver-impala/clouderahiveodbc_*.deb")))
       download.file(paste0("https://downloads.cloudera.com/connectors/impala_odbc_2.5.40.1025/",
       "Debian/clouderaimpalaodbc_2.5.40.1025-2_amd64.deb"),
       destfile = "driver-impala/clouderaimpalaodbc_2.5.40.1025-2_amd64.deb")
      
       # To install the driver.
       system("sudo dpkg -i driver-impala/clouderaimpalaodbc_*.deb", intern = TRUE)
      
       # To configure the driver and the "/etc/odbcinst.ini" file.
       system("sudo apt-get install -y odbcinst", intern = TRUE)
       system("sudo odbcinst -i -d -f /opt/cloudera/impalaodbc/Setup/odbcinst.ini", intern = TRUE)
      
      }
    5. TROUBLESHOOTING: Fix Table Display Names in RStudio and dbListTables.

      1. To fix the table display names issue in RStudio and dbListTables, run the following command line with an administrator account:

        sudo sed -i 's/DriverManagerEncoding=UTF-32/DriverManagerEncoding=UTF-16/g' /opt/cloudera/impalaodbc/lib/64/cloudera.impalaodbc.ini
      2. Disconnect from Impala and restart your R session.
        When you connect to Impala again, you can see the full table names displayed.

Connect with JDBC (Java Database Connectivity)

  1. Install the RJDBC package to allow R to connect to any DBMS (Database Management System) that has a JDBC driver.

  2. Download the Impala JDBC driver by running the following lines of code:

    download.file('https://github.com/Mu-Sigma/RImpala/blob/master/impala-jdbc-cdh5.zip?raw=true', 'jdbc.zip',
                  method ='curl', extra ='-L')
    unzip('jdbc.zip')
  3. Load the driver and the JDBC connection by running the following lines of code:

    drv <- JDBC(driverClass = "org.apache.hive.jdbc.HiveDriver",
     classPath = list.files("impala-jdbc-cdh5", pattern = "jar$", full.names = TRUE),
     identifier.quote = "`")
    
    impalaConnectionUrl <- "jdbc:hive2://datanode1dns:port/"
    conn <- dbConnect(drv, impalaConnectionUrl,user="xxxx",password="xxxx")

    Where impalaConnectionUrl allows communication with Impala via a data node without authentication.

    The Impala connection URL format must be jdbc:>hive2://datanode1dns:port/;auth=noSasl where 21050 is the default port.
  4. You can now interact with Impala tables by running the following lines of code:

    • List Impala Tables

    • Get All Element of a Impala Table

    • Create a Table in parquet Format

    • Insert Data Into Impala Tables

    • Refresh Tables

    # To list all databases.
    dbListTables(conn)
    # To list the "default" database.
    dbGetQuery(conn,"show tables")
    # To get all the elements of the "default" database.
    d <- dbReadTable(conn,"table_name")
    # OR
    d <- dbGetQuery(conn,"select * from table_name")
    # To get all the other elements that are not in the "default" database.
    d <- dbReadTable(conn, nameBDD.table_name)
    dbSendUpdate(conn, "CREATE TABLE table_name (attribute1 string, attribute2 int) STORED AS PARQUET")
    dbGetQuery(conn, "INSERT INTO table_name VALUES ('test', 1)")
    dbGetQuery(conn, "INVALIDATE METADATA")