Gobblin is a great tool for ETL. It has good abstract concepts. It helps me a lot in the past two years.
But, the following reasons made me want to give up.
- Meet some exceptions which are difficult to fix
- OutOfMemory
- Task process got hung
- Difficult to run Gobblin in cluster mode
- MapReduce mode is hard to use
- YARN mode needs Helix, which is not as common as HDFS and YARN
- The components have good abstract concepts. But it’s not easy to do some change for some basic classes
-
A lot of accumulated questions
They made my job delayed several times. In such cases, I turned to Spark, which solve the problems elegantly. The operators of Spark are at a lower level compared to Gobblin’s components, but they are high enough and flexible. With the combination of workflow schedule tools, it’s able to schedule a lot of Spark applications across the cluster. Most important, Spark is robust and has no risk on feasibility.