How to Do Big Data Integrity Validation

Background

For one input record, it could be applied:

map -> convert to another record
flatMap -> convert to some other records
filter -> keep or drop
aggregate -> group by some keys and aggregate to another record
…

And in a data flow, one record may be processed a lot of times.

We want to do data validation in such a situation, in the Big Data environment. To check

If there is data loss
If there is any calculation error

Ideas

As the picture shows above, from 0% to 100% of checking output records

None(0%)

It has a very high risk. It’s unacceptable.
Every(100%)

It has a very high cost. It’s infeasible. For n input records, after several processing steps, it may output n*n*... records. In the Big Data environment, we are unable to validate each step for every record.
Statistic

It’s done by most teams. We can do aggregation on some special features/dimensions and check the statistical value. By correlate related metrics, we can improve our confidence about the conclusion. However, the problem is that the statistic is just a roughly view in a lot of situations. We can not get the precise result. For example, if the count of one metric is 100 yesterday, we may suspect it if the count becomes 50 today. But what about 75? 85?
Sampling

By generating predefined inputs with special identifies, the output for each determined processing step can be validated by determined values. The sampling is always running with the online data flow to detect possible exceptions.

For Big Data, we can’t get a perfect solution for data integrity validation. By combining statistic and sampling, we can get an acceptable solution.

How to Do Big Data Integrity Validation by secfree was published on 2018-07-07