Background

You may have a huge volume of data to analyse, include

Hundred of tables in Hive
Hundred of fields in a table

It’s very common to do correlation by joining different tables when analysing. The joining condition is a tuple of fields. In order to finish the analyse as soon as possible, you need to check which tables contains some special fields quickly.

Solution

The simple code listed below is able to do the search.

import org.apache.spark.sql.SparkSession

import scala.collection.mutable

object Test {
  val db = "test"
  def main(args: Array[String]): Unit = {
    // Fields are separated by ","
    search(args(0).split(",").toSet)
  }

  def getSparkSession(name: String): SparkSession = {
    SparkSession.builder().appName(name).enableHiveSupport().getOrCreate()
  }

  def search(fields: Set[String]): Unit = {
    val spark = getSparkSession("Search-Table-Catalog")
    val tables = spark.catalog.listTables(db).collect()
    for (t <- tables) {
      val cols = spark.catalog.listColumns(db, t.name).collect()
      val colNames = new mutable.HashSet[String]()
      for (c <- cols) {
        colNames.add(c.name)
      }
      if (fields.subsetOf(colNames)) {
        println("Found match table: " + t)
      }
    }
  }
}

You can execute it in spark-shell or by spark-submit.

A Simple Tool for Searching Hive Tables by Fields with Spark by secfree was published on 2018-07-05