Understanding data pipelines
- Ahmad Alawami

- Aug 16, 2024
- 2 min read
Updated: Mar 26
One of the most fundamental concepts in data is 𝙙𝙖𝙩𝙖 𝙥𝙞𝙥𝙚𝙡𝙞𝙣𝙚𝙨.
A weak grasp of this concept is a key cause of unsuccessful data-driven decision-making and operations.
The term "pipeline" is derived from an analogy with water. Just as utilizing water requires building conduits that move it from its sources to where it is consumed, organizations build processes (i.e., sets of actions) that move data from its raw sources to user-friendly outlets.
Analogies, however, are not meant to be exact. Data is less tangible than water, rendering its 𝘴𝘱𝘢𝘯 𝘰𝘧 𝘮𝘰𝘷𝘦𝘮𝘦𝘯𝘵 much wider—ranging from very simple, almost merely formal, movements like loading one ".txt" file into another, to very complex movements like real-time data aggregations with ML scoring.
Generally, data in organizations moves through 5 phases, each phase characterized by an input (source), transformation (an activity effecting change/movement in the data), and output (destination).
Here are the 5 phases in a nutshell:
𝟭. 𝗖𝗼𝗹𝗹𝗲𝗰𝘁
The first phase is gathering data from raw sources, such as databases, APIs, IoT devices, applications, and business systems. All relevant data that could provide value should be captured (more is merrier).
→ Example: A retailer collects sales data from its POS systems, customer feedback from its website, and inventory data from its supply chain management system.
𝟮. 𝗜𝗻𝗴𝗲𝘀𝘁
This is an interim phase where data is loaded onto object stores or staging areas organized in event queues.
→ Example: Using tools like Apache Kafka or AWS Kinesis, the retail business ingests real-time sales data into their central data repository.
𝟯. 𝗦𝘁𝗼𝗿𝗲
Data needs to then be transferred to a central repository such as data warehouses and data lakes. Central data repositories enhance data integration, accessibility, and security.
→ Example: The retail business pools its data in a cloud data warehouse like Amazon Redshift or Google BigQuery for integration and later retrieval.
𝟰. 𝗖𝗼𝗺𝗽𝘂𝘁𝗲
This phase involves processing data to make it suitable for consumption. It includes cleaning, aggregating, and enriching the data (e.g., correcting formats, removing redundancies, and creating partitions).
→ Example: Using tools
like Apache Spark or Talend, the retail business cleans the data, removes duplicates, and aggregates it into a dataset(s).
𝟱. 𝗖𝗼𝗻𝘀𝘂𝗺𝗲
Finally, the processed data is made available for analysis, reporting, model-building, and the like. The data is fed to specific tools such as BI or AI tools, or made accessible to data professionals for specific use-cases or custom analyses.
→ Example: The retail business uses a BI tool like Power BI to create interactive dashboards that provide insights into sales performance, customer behavior, and inventory levels.





Comments