Use R on Impala With High Availability

You can use R on Impala with the high availability option. The high availability option allows you to connect to Impala with a random active DataNode to prevent your job from failing. It can also be used to distribute the workload across all DataNodes.

  1. Install the following package:

    # To load the ODBC package.
    library(odbc)
  2. Declare the check_datanodes function to return a named vector (true or false) by running the following lines of code:

    # WARNING: In this example, we have 9 DataNodes. You need to change this depending on your situation.
    DATANODES <- 'dn1;dn2;dn3;dn4;dn5;dn6;dn7;dn8;dn9'
    DATANODES <- unlist(strsplit(DATANODES,";"))
    
    # To test the working node in the provided list and return a named vector (true/false).
    check_datanodes<- function(host, port, schema, user, password, timeout){
      tryCatch(
        expr = {
        before <- getTaskCallbackNames()
        con <-DBI::dbConnect(odbc::odbc(),
                             Driver = ifelse(.Platform$OS.type == "windows",
                                             "Cloudera ODBC Driver for Impala",
                                             "Cloudera ODBC Driver for Impala 64-bit"),
                             Host = host,
                             Port = port,
                             Schema = schema,
                             AuthMech = 3,
                             UseSASL = 1,
                             UID = user,
                             PWD = password,
                             timeout = timeout)
        after <- getTaskCallbackNames()
        # To avoid warnings due to the RStudio Connections Pane.
        # before, after, and removeTaskCallback can be removed if they are used outside of RStudio.
        removeTaskCallback(which(!after %in% before))
        return(TRUE)
      },
      error = function(e){
        return(FALSE)
      })
    }
  3. Declare the random_node_connect function to establish a connection to an available random DataNode by running the following lines of code:

    random_node_connect <- function(nodelist, port, schema,user, password, timeout = 0.5){
      if(missing(nodelist)){
        stop("nodelist is mandatory, please provide it.", call. = FALSE)
      }
      if(missing(user) | missing(password)){
        stop("user or passsword is missing, please provide it.", call. = FALSE)
      }
    
      # To get a TRUE or FALSE vector for the responding nodes.
      answered <- sapply(nodelist, check_datanodes, port = port, schema= schema, user = user, password = password, timeout = timeout)
    
      # To get the names of the responding nodes.
      nodes_names <- names(answered[answered == TRUE])
    
      # To choose a random node.
      rand_node <- nodes_names[sample(1:length(nodes_names), 1)]
    
      # To message with the choosen DataNode.
      message(paste0("Connection to : ", rand_node))
    
      # To return the randomly selected connection object from the list of available working nodes.
      return(DBI::dbConnect(odbc::odbc(),
                            Driver = ifelse(.Platform$OS.type == "windows",
                                            "Cloudera ODBC Driver for Impala",
                                            "Cloudera ODBC Driver for Impala 64-bit"),
                            Host =rand_node,
                            Port = port,
                            Schema = schema,
                            AuthMech = 3,
                            UseSASL = 1,
                            UID = user,
                            PWD = password,
                            timeout = timeout)
      )
    }

    You can now connect to an available random DataNode.

  4. Run the following lines of code to get the list of available active DataNodes:

    # To return a list of available DataNodes.
    available_dn <- sapply(DATANODES,
                           check_datanodes,
                           port = Sys.getenv("PORT_IMPALA"),
                           schema = "default",
                           user= Sys.getenv("MY_USER"),
                           password = Sys.getenv("MY_PWD"),
                           timeout = 0.4)
  5. Run the following lines of code to establish the connection to an active random DataNode:

    # To establish a connection to an available random DataNode.
    con <- random_node_connect(nodelist = DATANODES,
                               port = Sys.getenv("PORT_IMPALA"),
                               schema = "default",
                               user = Sys.getenv("MY_USER"),
                               password = Sys.getenv("PW_PWD"),
                               timeout = 0.2)

    Where:

    • Driver is the name of the driver to use. By default, it is Cloudera ODBC Driver for Impala 64-bit on the platform.

      The default name on Windows is usually Cloudera ODBC Driver for Impala, and Cloudera ODBC Driver for Impala 64-bit on Linux.
    • Host is the IP or host name of the Impala database which can be found under Impala  External Connections  Host with the format dn1.pX.company.prod.saagie.io.

    • Port is 21050 by default.

    • Schema is the schema in which the queries are executed.

    • AuthMech is the authentication mechanism. Use value 3 for authentication with user and password.

    • UseSASL is for Simple Authentication and Security Layer. Use value 1 for authentication with user and password.

    • UID is your Saagie username on the platform.

    • PWD is your Saagie password on the platform.

    • timeout is to end the connection attempt after x seconds.