Read and Write Tables From Impala With R
Connect with ODBC (Open Database Connectivity)
All dependencies and drivers should already be installed on the Saagie platform. If this is not the case, or if you want to connect from outside the platform, read the installation details at the end of the ODBC section.
-
Authenticate with or without Kerberos.
library(DBI) con <- dbConnect(odbc::odbc(), Driver = "Cloudera ODBC Driver for Impala 64-bit", Host = "dn1.pX.company.prod.saagie.io", Port = 21050, Schema = "default", AuthMech = 3, UseSASL = 1, UID = "user", PWD = "password")
Where:
-
Driver
is the name of the driver to use. By default, it isCloudera ODBC Driver for Impala 64-bit
on the platform.The default name on Windows is usually Cloudera ODBC Driver for Impala
, andCloudera ODBC Driver for Impala 64-bit
on Linux. -
Host
is the IP or host name of the Impala database which can be found under with the formatdn1.pX.company.prod.saagie.io
. -
Port
is21050
by default. -
Schema
is the schema in which the queries are executed. -
AuthMech
is the authentication mechanism. Use value3
for authentication with user and password. -
UseSASL
is for Simple Authentication and Security Layer. Use value1
for authentication with user and password. -
UID
is your Saagie username on the platform. -
PWD
is your Saagie password on the platform.
If your Impala is SSL secured, add the following parameters to your command:
-
SSL=1
: This parameter is mandatory and allows you to communicate with the server via SSL. -
AllowSelfSignedServerCert=1
: This parameter is optional and allows authentication using self-signed certificates that have not been added to the list of trusted certificates.The certificate chain file is not yet available in Saagie containers. -
AllowHostNameCNMismatch=1
: This parameter is optional and allows the common name of an SSL certificate issued by a certificate authority to not match the host name of the Impala server.
library(DBI) library(getPass) # Method 1 (interactive) : Use in RStudio. Enter your password in an interactive pop-up window. system('kinit user',input=getPass('Enter your password: ')) # Method 2 (scripts) : Use outside of RStudio. # Write the password in the command line or store it in an environment variable. # Uncomment the following line to use it. # system('echo password | kinit user') con <- dbConnect(odbc::odbc(), Driver = "Cloudera ODBC Driver for Impala 64-bit", Host = "dn1.pX.company.prod.saagie.io", Port = 21050, Schema = "default", AuthMech = 1)
Where:
-
Driver
is the name of the driver to use. By default, it isCloudera ODBC Driver for Impala 64-bit
on the platform.The default name on Windows is usually Cloudera ODBC Driver for Impala
, andCloudera ODBC Driver for Impala 64-bit
on Linux. -
Host
is the IP or host name of the Impala database with the formatdn1.pX.company.prod.saagie.io
. -
Port
is21050
by default. -
Schema
is the schema in which the queries are executed. -
AuthMech
is the authentication mechanism. Use value1
for authentication with Kerberos ticket. -
UID
is your Saagie username on the platform. -
PWD
is your Saagie password on the platform.
-
-
You can now interact with Hive tables by running the following lines of code:
# Troubleshooting: If you only see the first letter of each table, see "Fix Table Display Names in RStudio and dbListTables" below. # Another workaround is to use "show schemas" and "show tables" in plain SQL. # To list all tables of all schemas. dbListTables(con) dbGetQuery(con, 'show schemas') dbGetQuery(con, 'show tables')
Fix Table Display Names in RStudio and dbListTables
-
To fix the table display names issue in RStudio and dbListTables, run the following command line with an administrator account:
sudo sed -i 's/DriverManagerEncoding=UTF-32/DriverManagerEncoding=UTF-16/g' /opt/cloudera/impalaodbc/lib/64/cloudera.impalaodbc.ini
-
Disconnect from Impala and restart your R session.
When you connect to Impala again, you can see the full table names displayed.
sample <- DBI::dbReadTable(con, "sample") # Fetch all data from a table
# To create a lazy "tbl" from an Impala table. forecast_db <- tbl(con, in_schema('forecast', 'forecast_weekly')) forecast_db # Print columns and column types # The query is written in dplyr syntax, but is executed on a remote SQL database. query <- forecast_db %>% summarise(mean_forecast = forecast %>% as.numeric() %>% mean) show_query(query) # To show the query that will be executed. query # To execute the lazy query when the result is printed. # Example of verbs used by dplyr. forecast_db %>% filter(prediction_date == max(prediction_date)) %>% group_by(reference) %>% summarise(forecast_mean = mean(forecast), forecast_max = max(forecast), nb_forecast_total = n()) sales_db <- tbl(con, in_schema('forecast', 'sales_weekly')) sales_db forecast_db <- forecast_db %>% mutate(reference = as.character(reference)) diff_db <- inner_join(forecast_db, sales_db, by=c('reference', 'year', 'week')) diff_by_ref <- diff_db %>% group_by(reference, year, week) %>% summarise(diff_by_week = abs(forecast - quantity)) %>% # To differentiate between forecast and reality for each prediction. group_by(reference) %>% summarise(diff_by_ref = sum(diff_by_week)) # Sum of all differences for each reference. diff_db # To execute all the lazy queries above.
The default function for writing from R to Impala only works for small amounts of data. Since HDFS is used by default on Saagie platforms, let’s use it here to speed up the data transfer from R to Impala.
The example function below works in four steps:
-
It uploads the data from R to HDFS in
/tmp/
. -
It creates a temporary external table on the uploaded CSV file.
-
It copies the contents of the temporary table to the final Impala table in
parquet
format. -
It removes the temporary table and the CSV file used.
Specify the platform-specific HDFS web URL in the function. If an error occurs, verify the permissions of your Impala user and change the HDFS user used. #' To write data from R to Impala using HDFS. (1) #' #' @param con : The Impala connexion. #' @param name : The name of the table created in Impala. #' @param obj : The R DataFrame. #' @param database : The Impala database to use. #' @param overwrite : Overwrites the current table if it already exists. #' fastWriteTable <- function(con, name, obj, database = 'default', overwrite = F) { library(DBI) library(stringi) library(httr) library(data.table) if(!overwrite) { showRequest <- paste0("show tables in `", database, "` like '", name, "'") exists <- dbGetQuery(con, showRequest) if(nrow(exists) > 0) { print(paste0('Table ', database, '.', name, ' already exists. Use overwrite = T to overwrite table')) return('Table already exists') } } # HDFS information. hdfsUrl <- "https://nn1.pX.companyName.saagie.io:50470/webhdfs/v1" hdfsUser <- '&user.name=impala' # Informations about the CSV table. tableCsv <- paste0('tmp__', name, '__tmp') # Name of the table in Impala location <- paste0('/tmp/', tableCsv, '/') # Folder location of the csv file separator <- ';' # Separator in the csv file nameCsv <- paste0(name, '.csv') # To select the database. dbGetQuery(con, paste0('use ', database)) # Column names and classes. cl <- sapply(obj, class) n <- names(cl) # To match R and Impala column types. r_types <- c('numeric', 'integer', 'character', 'factor', 'c(\"POSIXct\", \"POSIXt\")') impala_types <- c('DOUBLE', 'INTEGER', 'STRING', 'STRING', 'TIMESTAMP') # To replace R types by Impala types. cl <- stringi::stri_replace_all_fixed(cl, r_types, impala_types, vectorize_all=F) # To create each column as correct type. namesTypes <- paste0(dbQuoteIdentifier(con, make.names(n)), ' ', cl, collapse = ', ') dropRequest <- paste0("DROP TABLE IF EXISTS `", tableCsv, "`") dbGetQuery(con, dropRequest) # To write the CSV to disk and copy it to Impala. fwrite(obj, nameCsv, col.names = F, sep = ';') set_config( config( ssl_verifypeer = 0L ) ) uri <- paste0(hdfsUrl, location, nameCsv, '?op=CREATE&overwrite=true', hdfsUser) response <- PUT(uri) stop_for_status(response) uriWrite <- response$url response <- PUT(uriWrite, body = upload_file(nameCsv)) stop_for_status(response) # To create a table on the CSV file. (2) createRequest <- paste0("CREATE TABLE `", tableCsv, "` (", gsub('.', '_', namesTypes, fixed = T), ") row format delimited fields terminated by '", separator, "' stored as textfile location '", location, "'") dbGetQuery(con, createRequest) dropRequest2 <- paste0("DROP TABLE IF EXISTS `", name, "`") dbGetQuery(con, dropRequest2) # To copy the CSV table into a "parquet" table. (3) parquetRequest <- paste0("CREATE TABLE `", name, "` STORED AS PARQUET AS SELECT * FROM `", tableCsv, "`") dbGetQuery(con, parquetRequest) # To optimize the table. computeRequest <- paste0("COMPUTE STATS `", name, "`") dbGetQuery(con, computeRequest) # To remove the temporary CSV table. (4) dropTmpRequest <- paste0("DROP TABLE `", tableCsv, "`") dbGetQuery(con, dropTmpRequest) return('Table correctly uploaded') }
-
OPTIONAL: Installation Details
-
To install the required dependencies, run the following lines of code:
pcks <- c('DBI', 'odbc', 'dplyr', 'dbplyr', 'implyr', 'getPass') # List of packages dependencies for (pck in pcks[pcks %in% installed.packages()]) install.packages(pck) # Install if necessary
Where:
-
DBI
is a standard database interface. -
odbc
allows to connect to ODBC databases using DBI. -
dplyr
is a data manipulation library. -
dbplyr
allows to convert data manipulation written in R to SQL. -
implyr
allows to convert data manipulation written in R to Impala. -
getPass
is a library to hide password typed in RStudio app.
-
-
To connect from outside the platform, download the Impala ODBC driver from Cloudera.
-
Depending on whether you are using Windows or Linux, do as follows:
-
Install the Windows installer (MSI).
-
Install the
odbcinst
command line tool:sudo apt-get install odbcinst
-
Install the following Debian (
.deb
) package:sudo dpkg -i clouderaimpalaodbc_*.deb
You can install other extensions depending on your distribution. -
Register the driver as it is installed by running the following command line:
sudo odbcinst -i -d -f /opt/cloudera/impalaodbc/Setup/odbcinst.ini
-
Install the following packages:
if (!length(find.package("odbc", quiet = TRUE))) { # To install OS package for ODBC. system("sudo apt-get install -y unixodbc unixodbc-dev", intern = TRUE) # To install R package for ODBC. install.packages("odbc", repos = "https://cloud.r-project.org") } # If Impala ODBC driver is missing: if (!file.exists("/etc/odbcinst.ini") || !length(grep("Cloudera ODBC Driver for Impala 64-bit", readLines("/etc/odbcinst.ini"), ignore.case = TRUE))) { # To retrieve Impala drivers for Debianoids from Cloudera. if (!dir.exists("driver-impala")) dir.create("driver-impala") if (!length(Sys.glob("driver-impala/clouderahiveodbc_*.deb"))) download.file(paste0("https://downloads.cloudera.com/connectors/impala_odbc_2.5.40.1025/", "Debian/clouderaimpalaodbc_2.5.40.1025-2_amd64.deb"), destfile = "driver-impala/clouderaimpalaodbc_2.5.40.1025-2_amd64.deb") # To install the driver. system("sudo dpkg -i driver-impala/clouderaimpalaodbc_*.deb", intern = TRUE) # To configure the driver and the "/etc/odbcinst.ini" file. system("sudo apt-get install -y odbcinst", intern = TRUE) system("sudo odbcinst -i -d -f /opt/cloudera/impalaodbc/Setup/odbcinst.ini", intern = TRUE) }
TROUBLESHOOTING: Fix Table Display Names in RStudio and dbListTables.
-
To fix the table display names issue in RStudio and dbListTables, run the following command line with an administrator account:
sudo sed -i 's/DriverManagerEncoding=UTF-32/DriverManagerEncoding=UTF-16/g' /opt/cloudera/impalaodbc/lib/64/cloudera.impalaodbc.ini
-
Disconnect from Impala and restart your R session.
When you connect to Impala again, you can see the full table names displayed.
-
Connect with JDBC (Java Database Connectivity)
-
Install the RJDBC package to allow R to connect to any DBMS (Database Management System) that has a JDBC driver.
-
Download the Impala JDBC driver by running the following lines of code:
download.file('https://github.com/Mu-Sigma/RImpala/blob/master/impala-jdbc-cdh5.zip?raw=true', 'jdbc.zip', method ='curl', extra ='-L') unzip('jdbc.zip')
-
Load the driver and the JDBC connection by running the following lines of code:
drv <- JDBC(driverClass = "org.apache.hive.jdbc.HiveDriver", classPath = list.files("impala-jdbc-cdh5", pattern = "jar$", full.names = TRUE), identifier.quote = "`") impalaConnectionUrl <- "jdbc:hive2://datanode1dns:port/" conn <- dbConnect(drv, impalaConnectionUrl,user="xxxx",password="xxxx")
Where
impalaConnectionUrl
allows communication with Impala via a data node without authentication.The Impala connection URL format must be jdbc:>hive2://datanode1dns:port/;auth=noSasl
where21050
is the default port. -
You can now interact with Impala tables by running the following lines of code:
# To list all databases. dbListTables(conn) # To list the "default" database. dbGetQuery(conn,"show tables")
# To get all the elements of the "default" database. d <- dbReadTable(conn,"table_name") # OR d <- dbGetQuery(conn,"select * from table_name") # To get all the other elements that are not in the "default" database. d <- dbReadTable(conn, nameBDD.table_name)
dbSendUpdate(conn, "CREATE TABLE table_name (attribute1 string, attribute2 int) STORED AS PARQUET")
dbGetQuery(conn, "INSERT INTO table_name VALUES ('test', 1)")
dbGetQuery(conn, "INVALIDATE METADATA")
-
Code example to read and write tables from Impala with R (GitHub page)