Considerations for analytical business data transformation pipelines

There are a number of pitfalls involved in writing business data transformation pipelines for analytical datasets. The following are some of the most important considerations when designing and implementing such pipelines.

Resource isolation

A common way for business data transformation pipelines to produce value is to join multiple datasets together.

Joining datasets together is valuable, but is also resource-intensive. It can require many CPU cycles and high amounts of memory. If your primary service is both serving customers and performing data transformation the result can be resource contention. Your latency to serve customers may increase due to a lack of available resources.

It is not acceptable to degrade use experience as a result of data transformation logic. Keep resources used to serve customers separate from resources used to transform data.

Computational distribution

Joining multiple datasets together can talk a long time.

It is important to build the data processing logic such that the work can be distributed across multiple machines. Flexible distribution of work across multiple machines makes the execution time of the pipeline stay consistent as input data volumes grow.

Typical data processing frameworks like Spark, Hive, and AWS EMR support distributing data processing across multiple machines automatically and dynamically.

Backfill

In the context of business data transformation pipelines, backfill means rerunning the pipeline on previously processed data.

The most common reason to perform a backfill is after a bug in the data processing logic has been found. Fixing the bug will make future data correct, but past data may also need to be corrected by running a backfill.

Most distributed batch data processing pipelines can be rerun on a previous date to perform a backfill. It is important that the input data to the pipeline is retained for long enough, otherwise, a backfill may not be possible due to missing input data.

Data latency

Acceptable data latency must be defined for all datasets.

Latency requirements have a large impact on technology considerations and dataset use cases. Engineers are not going to be able to design high-quality data processing systems without clear data latency requirements.

I do not recommend setting tight data latency requirements for analytical data sets. Refreshing data every hour should be enough for most analytical data use cases. When very fresh data is required it is often better to use raw data without performing any data transformations.

Monitoring, alerting, and playbooks

Data transformation pipelines require monitoring, alerting, and playbooks like any other service.

Monitoring and alerting are best performed based on pipeline success or failure events. If a pipeline takes too long to succeed or fails multiple times in a row then an alert should be fired.

The standard action to take when an alert is fired is to roll back the pipeline code to the previously deployed version. Rollbacks should be relatively straightforward with distributed batch pipelines, but if failures have been going on too long then a backfill may need to be performed to eliminate any gaps in the data.

Leave a Reply

Up ↑

Discover more from Max Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading