Use R on Impala With High Availability
-
Install the following package:
# To load the ODBC package. library(odbc)
-
Declare the
check_datanodes
function to return a named vector (true
orfalse
) by running the following lines of code:# WARNING: In this example, we have 9 DataNodes. You need to change this depending on your situation. DATANODES <- 'dn1;dn2;dn3;dn4;dn5;dn6;dn7;dn8;dn9' DATANODES <- unlist(strsplit(DATANODES,";")) # To test the working node in the provided list and return a named vector (true/false). check_datanodes<- function(host, port, schema, user, password, timeout){ tryCatch( expr = { before <- getTaskCallbackNames() con <-DBI::dbConnect(odbc::odbc(), Driver = ifelse(.Platform$OS.type == "windows", "Cloudera ODBC Driver for Impala", "Cloudera ODBC Driver for Impala 64-bit"), Host = host, Port = port, Schema = schema, AuthMech = 3, UseSASL = 1, UID = user, PWD = password, timeout = timeout) after <- getTaskCallbackNames() # To avoid warnings due to the RStudio Connections Pane. # before, after, and removeTaskCallback can be removed if they are used outside of RStudio. removeTaskCallback(which(!after %in% before)) return(TRUE) }, error = function(e){ return(FALSE) }) }
-
Declare the
random_node_connect
function to establish a connection to an available random DataNode by running the following lines of code:random_node_connect <- function(nodelist, port, schema,user, password, timeout = 0.5){ if(missing(nodelist)){ stop("nodelist is mandatory, please provide it.", call. = FALSE) } if(missing(user) | missing(password)){ stop("user or passsword is missing, please provide it.", call. = FALSE) } # To get a TRUE or FALSE vector for the responding nodes. answered <- sapply(nodelist, check_datanodes, port = port, schema= schema, user = user, password = password, timeout = timeout) # To get the names of the responding nodes. nodes_names <- names(answered[answered == TRUE]) # To choose a random node. rand_node <- nodes_names[sample(1:length(nodes_names), 1)] # To message with the choosen DataNode. message(paste0("Connection to : ", rand_node)) # To return the randomly selected connection object from the list of available working nodes. return(DBI::dbConnect(odbc::odbc(), Driver = ifelse(.Platform$OS.type == "windows", "Cloudera ODBC Driver for Impala", "Cloudera ODBC Driver for Impala 64-bit"), Host =rand_node, Port = port, Schema = schema, AuthMech = 3, UseSASL = 1, UID = user, PWD = password, timeout = timeout) ) }
You can now connect to an available random DataNode.
-
Run the following lines of code to get the list of available active DataNodes:
# To return a list of available DataNodes. available_dn <- sapply(DATANODES, check_datanodes, port = Sys.getenv("PORT_IMPALA"), schema = "default", user= Sys.getenv("MY_USER"), password = Sys.getenv("MY_PWD"), timeout = 0.4)
-
Run the following lines of code to establish the connection to an active random DataNode:
# To establish a connection to an available random DataNode. con <- random_node_connect(nodelist = DATANODES, port = Sys.getenv("PORT_IMPALA"), schema = "default", user = Sys.getenv("MY_USER"), password = Sys.getenv("PW_PWD"), timeout = 0.2)
Where:
-
Driver
is the name of the driver to use. By default, it isCloudera ODBC Driver for Impala 64-bit
on the platform.The default name on Windows is usually Cloudera ODBC Driver for Impala
, andCloudera ODBC Driver for Impala 64-bit
on Linux. -
Host
is the IP or host name of the Impala database which can be found under with the formatdn1.pX.company.prod.saagie.io
. -
Port
is21050
by default. -
Schema
is the schema in which the queries are executed. -
AuthMech
is the authentication mechanism. Use value3
for authentication with user and password. -
UseSASL
is for Simple Authentication and Security Layer. Use value1
for authentication with user and password. -
UID
is your Saagie username on the platform. -
PWD
is your Saagie password on the platform. -
timeout
is to end the connection attempt after x seconds.
-
-
Connect to Impala with high availability (GitHub page)
-
R package used to connect to Impala with a random active DataNode (GitHub page)