Maybe I should Give up Using Gobblin

Gobblin is a great tool for ETL. It has good abstract concepts. It helps me a lot in the past two years.

But, the following reasons made me want to give up.

  1. Meet some exceptions which are difficult to fix
    1. OutOfMemory
    2. Task process got hung
  2. Difficult to run Gobblin in cluster mode
    1. MapReduce mode is hard to use
    2. YARN mode needs Helix, which is not as common as HDFS and YARN
  3. The components have good abstract concepts. But it’s not easy to do some change for some basic classes
  4. A lot of accumulated questions

They made my job delayed several times. In such cases, I turned to Spark, which solve the problems elegantly. The operators of Spark are at a lower level compared to Gobblin’s components, but they are high enough and flexible. With the combination of workflow schedule tools, it’s able to schedule a lot of Spark applications across the cluster. Most important, Spark is robust and has no risk on feasibility.