Read and Write Tables From Hive With R
Connect with ODBC (Open Database Connectivity)
All dependencies and drivers should already be installed on the Saagie platform. If this is not the case, or if you want to connect from outside the platform, read the installation details at the end of the ODBC section.
-
Authenticate with or without Kerberos.
library(DBI) con <- dbConnect(odbc::odbc(), Driver = "Cloudera ODBC Driver for Apache Hive 64-bit", Host = "nn1.pX.company.prod.saagie.io", Port = 10000, Schema = "default", AuthMech = 3, UseSASL = 1, UID = "user", PWD = "password")
Where:
-
Driver
is the name of the driver to use. By default, it isCloudera ODBC Driver for Apache Hive 64-bit
on the platform.The default name on Windows is usually Cloudera ODBC Driver for Impala
, andCloudera ODBC Driver for Impala 64-bit
on Linux. -
Host
is the IP or host name of the Impala database, which can be found under with the formatnn1.pX.company.prod.saagie.io
. -
Port
is10000
by default. -
Schema
is the schema in which the queries are executed. -
AuthMech
is the authentication mechanism. Use value3
for authentication with user and password. -
UseSASL
is for Simple Authentication and Security Layer. Use value1
for authentication with user and password. -
UID
is your Saagie username on the platform. -
PWD
is your Saagie password on the platform.
If your Impala is SSL secured, add the following parameters to your command:
-
SSL=1
: This parameter is mandatory and allows you to communicate with the server via SSL. -
AllowSelfSignedServerCert=1
: This parameter is optional and allows authentication using self-signed certificates that have not been added to the list of trusted certificates.The certificate chain file is not yet available in Saagie containers. -
AllowHostNameCNMismatch=1
: This parameter is optional and allows the common name of an SSL certificate issued by a certificate authority to not match the host name of the Impala server.
library(DBI) library(getPass) # Method 1 (interactive) : Use in RStudio. Enter your password in an interactive pop-up window. system('kinit user',input=getPass('Enter your password: ')) # Method 2 (scripts) : Use outside of RStudio. # Write the password in the command line or store it in an environment variable. # Uncomment the following line to use it. # system('echo password | kinit user') con <- dbConnect(odbc::odbc(), Driver = "Cloudera ODBC Driver for Apache Hive 64-bit", Host = "nn1.pX.company.prod.saagie.io", Port = 10000, Schema = "default", AuthMech = 1)
Where:
-
Driver
is the name of the driver to use. By default, it isCloudera ODBC Driver for Apache Hive 64-bit
on the platform.The default name on Windows is usually Cloudera ODBC Driver for Apache Hive
, andCloudera ODBC Driver for Apache Hive 64-bit
on Linux. -
Host
is the IP or host name of the Impala database with the formatnn1.pX.company.prod.saagie.io
. -
Port
is10000
by default. -
Schema
is the schema in which the queries are executed. -
AuthMech
is the authentication mechanism. Use value1
for authentication with Kerberos ticket. -
UID
is your Saagie username on the platform. -
PWD
is your Saagie password on the platform.
-
-
You can now interact with Hive tables by running the following lines of code:
# In this example of the Linux Hive ODBC driver, the dbListTables method returns only the first letter of each schema and table. # A workaround is to use "show schemas" and "show tables" in plain SQL. # To list all tables of all schemas. dbListTables(con) dbGetQuery(con, 'show schemas') dbGetQuery(con, 'show tables')
sample <- DBI::dbReadTable(con, "sample") # Retrieve all data from a table
# To create a lazy "tbl" from an Impala table. forecast_db <- tbl(con, in_schema('forecast', 'forecast_weekly')) forecast_db # Print columns and column types # The query is written in dplyr syntax, but is executed on a remote SQL database. query <- forecast_db %>% summarise(mean_forecast = forecast %>% as.numeric() %>% mean) show_query(query) # To show the query that will be executed. query # To execute the lazy query when the result is printed. # Example of verbs used by dplyr. forecast_db %>% filter(prediction_date == max(prediction_date)) %>% group_by(reference) %>% summarise(forecast_mean = mean(forecast), forecast_max = max(forecast), nb_forecast_total = n()) sales_db <- tbl(con, in_schema('forecast', 'sales_weekly')) sales_db forecast_db <- forecast_db %>% mutate(reference = as.character(reference)) diff_db <- inner_join(forecast_db, sales_db, by=c('reference', 'year', 'week')) diff_by_ref <- diff_db %>% group_by(reference, year, week) %>% summarise(diff_by_week = abs(forecast - quantity)) %>% # To differentiate between forecast and reality for each prediction. group_by(reference) %>% summarise(diff_by_ref = sum(diff_by_week)) # Sum of all differences for each reference. diff_db # To execute all the lazy queries above.
OPTIONAL: Installation Details
-
To install the required dependencies, run the following lines of code:
pcks <- c('DBI', 'odbc', 'dplyr', 'dbplyr', 'getPass') # List of packages dependencies for (pck in pcks[pcks %in% installed.packages()]) install.packages(pck) # Install if required
Where:
-
DBI
is a standard database interface. -
odbc
allows to connect to ODBC databases using DBI. -
dplyr
is a data manipulation library. -
dbplyr
allows to convert data manipulation written in R to SQL. -
getPass
is a library to hide password typed in RStudio app.
-
-
To connect from outside the platform, run the following lines of code:
# To retrieve Hive drivers for Debianoids from Cloudera if (!dir.exists("driver-hive")) dir.create("driver-hive") if (!length(Sys.glob("driver-hive/clouderahiveodbc_*.deb"))) download.file(paste0("https://downloads.cloudera.com/connectors/hive-2.5.25.1020/", "Debian/clouderahiveodbc_2.5.25.1020-2_amd64.deb"), destfile = "driver-hive/clouderahiveodbc_2.5.25.1020-2_amd64.deb") # To install the driver. system("sudo dpkg -i driver-hive/clouderahiveodbc_*.deb", intern = TRUE) # To configure the driver and the /etc/odbcinst.ini file. system("sudo apt-get install -y odbcinst", intern = TRUE) system("sudo odbcinst -i -d -f /opt/cloudera/hiveodbc/Setup/odbcinst.ini", intern = TRUE) # To install ODBC. system("sudo apt-get install -y unixodbc-dev", intern = TRUE) # To install the R package for ODBC. pkg <- "odbc" if (!length(find.package(pkg, quiet = TRUE))) install.packages(pkg, repos = "https://cloud.r-project.org") con <- dbConnect(odbc::odbc(), Driver = "Cloudera ODBC Driver for Apache Hive 64-bit", Host = "nn1.pX.company.prod.saagie.io", Port = 10000, Schema = "default", AuthMech = 3, UseSASL = 1, UID = "user", PWD = "password") # To check the connection (list of tables in the schema). DBI::dbGetQuery(con, "show tables") # Example of DataFrame. dat <- head(iris) # To make the names syntactically valid (removing dots). names(dat) <- make.names(gsub("[._]", replacement = "", names(dat))) # To write in Hive (and possibly overwrite). DBI::dbSendQuery(con, "drop table if exists iris") DBI::dbWriteTable(con, "iris", dat) # To read again dat2 <- DBI::dbReadTable(con, "iris") # Or: dat2 <- DBI::dbGetQuery(con, "select * from iris")
Connect with JDBC (Java Database Connectivity)
-
Install the RJDBC package to allow R to connect to any DBMS (Database Management System) that has a JDBC driver.
-
Download the Hive JDBC driver.
-
Load the driver and the JDBC connection by running the following lines of code:
drv <- JDBC(driverClass="org.apache.hive.jdbc.HiveDriver", classPath = list.files("Hive_Jdbc_Folder_Path",pattern="jar$",full.names=T), identifier.quote="`") hiveConnectionUrl <- "jdbc:hive2://IP:Port/;ssl=false" conn <- dbConnect(drv, hiveConnectionUrl,"User_HDFS","Password_HDFS")
-
You can now interact with Hive tables by running the following lines of code:
# To list all databases. dbListTables(conn) # To list the "default" database. dbGetQuery(conn, "show tables")
# To get all the elements of the "default" database. d <- dbReadTable(conn, "table_name") # OR d <- dbGetQuery(conn, "select * from table_name") # To get all the other elements that are not in the "default" database. d <- dbReadTable(conn, nameBDD.table_name)
dbSendUpdate(conn, "CREATE TABLE table_name (attribute1 string, attribute2 int) STORED AS PARQUET")
dbGetQuery(conn, "INSERT INTO table_name VALUES ('test', 1)")
-
Code example to read and write tables from Hive with R (GitHub page)