Ethiopian Medical Data Warehouse

A data engineering pipeline that scrapes, cleans, and warehouses Ethiopian medical and pharmaceutical data from Telegram channels, enabling downstream analytics and object detection for medical product images.

PythonPostgreSQLdbtApache AirflowYOLOv5PandasSQLAlchemy

Overview

A data engineering pipeline that scrapes, cleans, and warehouses Ethiopian medical and pharmaceutical data from Telegram channels, enabling downstream analytics and object detection for medical product images.

Problem

Ethiopian medical supply chains lack centralized, structured data. Pharmaceutical pricing, availability, and product information are scattered across informal Telegram channels. Building a reliable data warehouse enables market analysis, price monitoring, and supply chain insights.

Dataset

Raw data scraped from Ethiopian medical and pharmaceutical Telegram channels, including text messages, product images, and pricing information. Data spans multiple channels covering medical equipment, pharmaceuticals, and health services.

Architecture

ELT pipeline: raw data ingested into PostgreSQL staging tables, transformed via dbt models into a star schema data warehouse. YOLOv5 object detection model fine-tuned to identify and classify medical products in images. Apache Airflow orchestrates daily pipeline runs.

Training

YOLOv5 fine-tuned on a manually labeled dataset of medical product images. dbt transformations implement data quality checks, deduplication, and business logic. Incremental loading strategy handles daily data updates efficiently.

Results

Pipeline processes 500+ daily messages across 10+ channels. Object detection model achieves 78% mAP on medical product classification. Data warehouse enables price trend analysis and product availability tracking across the Ethiopian medical market.

GitHub Repository

View on GitHub