[ Кейс ]
Обновление данных: от хаоса к эффективности с помощью Airflow ETL, 6-кратного увеличения объема запросов и экономии ресурсов на 98,8%
Запрос клиента
BPMobile was grappling with an increasingly cluttered data storage system.
The data extraction process became very long, and daily data could not be downloaded in a day. Frequent errors in the database interfered with the work.
Задача проекта
Our primary objectives were to streamline BPMobile's data storage system, hasten the data retrieval process, and substantially reduce the occurrence of errors during data collection.
Основные проблемы, с которыми мы столкнулись:
-
Transitioning all ETL pipelines from conventional Python scripts to Apache Airflow.
-
Navigating the cluttered, inefficient state of the client's previous data storage system.
Решения и технологии
We have offered our code base: Migrated all ETL pipelines from Python scripts to Airflow, adding all Airflow benefits.
Migrated the existing DWH by changing the Redshift cluster and moving the heaviest data sources to Redshift Spectrum.
Implemented data pipelining to calculate ML model using AWS Batch and Docker.
This change not only streamlined processes but also enhanced efficiency.
Итог
We executed a comprehensive redesign of BPMobile's Data Warehouse, resulting in a 50% reduction in storage costs and amplifying query speeds sixfold. Our revamped ETL processes accelerated raw data collection by a factor of 12. Notably, previous pipeline instability issues were entirely resolved.
The introduction of a specialized data pipeline for ML model calculations led to an impressive resource-saving