Read and Write Tables From Hive With R

How to read and write tables from Hive with R via ODBC (Open Database Connectivity) or JDBC (Java Database Connectivity).

Connect with ODBC (Open Database Connectivity)

Before you begin

All dependencies and drivers should already be installed on the Saagie platform. If this is not the case, or if you want to connect from outside the platform, read the installation details at the end of the ODBC section.

  1. Authenticate with or without Kerberos.

    • Without Kerberos

    • With Kerberos

    library(DBI)
    
    con <- dbConnect(odbc::odbc(),
                          Driver   = "Cloudera ODBC Driver for Apache Hive 64-bit",
                          Host     = "nn1.pX.company.prod.saagie.io",
                          Port     = 10000,
                          Schema   = "default",
                          AuthMech = 3,
                          UseSASL  = 1,
                          UID      = "user",
                          PWD      = "password")

    Where:

    • Driver is the name of the driver to use. By default, it is Cloudera ODBC Driver for Apache Hive 64-bit on the platform.

      The default name on Windows is usually Cloudera ODBC Driver for Impala, and Cloudera ODBC Driver for Impala 64-bit on Linux.
    • Host is the IP or host name of the Impala database, which can be found under Impala  External Connections  Host with the format nn1.pX.company.prod.saagie.io.

    • Port is 10000 by default.

    • Schema is the schema in which the queries are executed.

    • AuthMech is the authentication mechanism. Use value 3 for authentication with user and password.

    • UseSASL is for Simple Authentication and Security Layer. Use value 1 for authentication with user and password.

    • UID is your Saagie username on the platform.

    • PWD is your Saagie password on the platform.

    If your Impala is SSL secured, add the following parameters to your command:

    • SSL=1: This parameter is mandatory and allows you to communicate with the server via SSL.

    • AllowSelfSignedServerCert=1: This parameter is optional and allows authentication using self-signed certificates that have not been added to the list of trusted certificates.

      The certificate chain file is not yet available in Saagie containers.
    • AllowHostNameCNMismatch=1: This parameter is optional and allows the common name of an SSL certificate issued by a certificate authority to not match the host name of the Impala server.

    library(DBI)
    library(getPass)
    
    # Method 1 (interactive) : Use in RStudio. Enter your password in an interactive pop-up window.
    system('kinit user',input=getPass('Enter your password: '))
    
    # Method 2 (scripts) : Use outside of RStudio.
    # Write the password in the command line or store it in an environment variable.
    # Uncomment the following line to use it.
    # system('echo password | kinit user')
    
    con <- dbConnect(odbc::odbc(),
                     Driver = "Cloudera ODBC Driver for Apache Hive 64-bit",
                     Host = "nn1.pX.company.prod.saagie.io",
                     Port = 10000,
                     Schema = "default",
                     AuthMech = 1)

    Where:

    • Driver is the name of the driver to use. By default, it is Cloudera ODBC Driver for Apache Hive 64-bit on the platform.

      The default name on Windows is usually Cloudera ODBC Driver for Apache Hive, and Cloudera ODBC Driver for Apache Hive 64-bit on Linux.
    • Host is the IP or host name of the Impala database with the format nn1.pX.company.prod.saagie.io.

    • Port is 10000 by default.

    • Schema is the schema in which the queries are executed.

    • AuthMech is the authentication mechanism. Use value 1 for authentication with Kerberos ticket.

    • UID is your Saagie username on the platform.

    • PWD is your Saagie password on the platform.

  2. You can now interact with Hive tables by running the following lines of code:

    • List Hive Tables

    • Get All Element of a Hive Table

    • Execute Queries in dplyr Syntax

    # In this example of the Linux Hive ODBC driver, the dbListTables method returns only the first letter of each schema and table.
    # A workaround is to use "show schemas" and "show tables" in plain SQL.
    
    # To list all tables of all schemas.
    dbListTables(con)
    dbGetQuery(con, 'show schemas')
    dbGetQuery(con, 'show tables')
    sample <- DBI::dbReadTable(con, "sample") # Retrieve all data from a table
    # To create a lazy "tbl" from an Impala table.
    forecast_db <- tbl(con, in_schema('forecast', 'forecast_weekly'))
    forecast_db # Print columns and column types
    
    # The query is written in dplyr syntax, but is executed on a remote SQL database.
    query <- forecast_db %>%
        summarise(mean_forecast = forecast %>% as.numeric() %>% mean)
    
    show_query(query) # To show the query that will be executed.
    query # To execute the lazy query when the result is printed.
    
    # Example of verbs used by dplyr.
    forecast_db %>%
        filter(prediction_date == max(prediction_date)) %>%
        group_by(reference) %>%
        summarise(forecast_mean = mean(forecast),
                  forecast_max = max(forecast),
                  nb_forecast_total = n())
    
    sales_db <- tbl(con, in_schema('forecast', 'sales_weekly'))
    sales_db
    
    forecast_db <- forecast_db %>%
        mutate(reference = as.character(reference))
    
    diff_db <- inner_join(forecast_db, sales_db, by=c('reference', 'year', 'week'))
    
    diff_by_ref <- diff_db %>%
        group_by(reference, year, week) %>%
        summarise(diff_by_week = abs(forecast - quantity)) %>% # To differentiate between forecast and reality for each prediction.
        group_by(reference) %>%
        summarise(diff_by_ref = sum(diff_by_week)) # Sum of all differences for each reference.
    
    diff_db # To execute all the lazy queries above.
OPTIONAL: Installation Details
  1. To install the required dependencies, run the following lines of code:

    pcks <- c('DBI', 'odbc', 'dplyr', 'dbplyr', 'getPass') # List of packages dependencies
    for (pck in pcks[pcks %in% installed.packages()]) install.packages(pck) # Install if required

    Where:

    • DBI is a standard database interface.

    • odbc allows to connect to ODBC databases using DBI.

    • dplyr is a data manipulation library.

    • dbplyr allows to convert data manipulation written in R to SQL.

    • getPass is a library to hide password typed in RStudio app.

  2. To connect from outside the platform, run the following lines of code:

    # To retrieve Hive drivers for Debianoids from Cloudera
    if (!dir.exists("driver-hive")) dir.create("driver-hive")
    if (!length(Sys.glob("driver-hive/clouderahiveodbc_*.deb")))
      download.file(paste0("https://downloads.cloudera.com/connectors/hive-2.5.25.1020/",
                           "Debian/clouderahiveodbc_2.5.25.1020-2_amd64.deb"),
                    destfile = "driver-hive/clouderahiveodbc_2.5.25.1020-2_amd64.deb")
    
    # To install the driver.
    system("sudo dpkg -i driver-hive/clouderahiveodbc_*.deb", intern = TRUE)
    
    # To configure the driver and the /etc/odbcinst.ini file.
    system("sudo apt-get install -y odbcinst", intern = TRUE)
    system("sudo odbcinst -i -d -f /opt/cloudera/hiveodbc/Setup/odbcinst.ini", intern = TRUE)
    
    # To install ODBC.
    system("sudo apt-get install -y unixodbc-dev", intern = TRUE)
    
    # To install the R package for ODBC.
    pkg <- "odbc"
    if (!length(find.package(pkg, quiet = TRUE))) install.packages(pkg, repos = "https://cloud.r-project.org")
    
    con <- dbConnect(odbc::odbc(),
                          Driver   = "Cloudera ODBC Driver for Apache Hive 64-bit",
                          Host     = "nn1.pX.company.prod.saagie.io",
                          Port     = 10000,
                          Schema   = "default",
                          AuthMech = 3,
                          UseSASL  = 1,
                          UID      = "user",
                          PWD      = "password")
    
    # To check the connection (list of tables in the schema).
    DBI::dbGetQuery(con, "show tables")
    
    # Example of DataFrame.
    dat <- head(iris)
    
    # To make the names syntactically valid (removing dots).
    names(dat) <- make.names(gsub("[._]", replacement = "", names(dat)))
    
    # To write in Hive (and possibly overwrite).
    DBI::dbSendQuery(con, "drop table if exists iris")
    DBI::dbWriteTable(con, "iris", dat)
    
    # To read again
    dat2 <- DBI::dbReadTable(con, "iris")
    # Or:
    dat2 <- DBI::dbGetQuery(con, "select * from iris")

Connect with JDBC (Java Database Connectivity)

  1. Install the RJDBC package to allow R to connect to any DBMS (Database Management System) that has a JDBC driver.

  2. Download the Hive JDBC driver.

  3. Load the driver and the JDBC connection by running the following lines of code:

    drv <- JDBC(driverClass="org.apache.hive.jdbc.HiveDriver",
                classPath = list.files("Hive_Jdbc_Folder_Path",pattern="jar$",full.names=T),
                identifier.quote="`")
    hiveConnectionUrl <- "jdbc:hive2://IP:Port/;ssl=false"
    conn <- dbConnect(drv, hiveConnectionUrl,"User_HDFS","Password_HDFS")
  4. You can now interact with Hive tables by running the following lines of code:

    • List Hive Tables

    • Get All Element of a Hive Table

    • Create a Table in parquet Format

    • Insert Data Into Hive Tables

    # To list all databases.
    dbListTables(conn)
    # To list the "default" database.
    dbGetQuery(conn, "show tables")
    # To get all the elements of the "default" database.
    d <- dbReadTable(conn, "table_name")
    # OR
    d <- dbGetQuery(conn, "select * from table_name")
    # To get all the other elements that are not in the "default" database.
    d <- dbReadTable(conn, nameBDD.table_name)
    dbSendUpdate(conn, "CREATE TABLE table_name (attribute1 string, attribute2 int) STORED AS PARQUET")
    dbGetQuery(conn, "INSERT INTO table_name VALUES ('test', 1)")