EthioMart Named Entity Recognition

A fine-tuned NER system for extracting product names, prices, and locations from Amharic e-commerce Telegram messages, enabling structured data extraction from unstructured Ethiopian marketplace content.

PythonHuggingFace TransformersPyTorchspaCyPandasJupyter

Overview

A fine-tuned NER system for extracting product names, prices, and locations from Amharic e-commerce Telegram messages, enabling structured data extraction from unstructured Ethiopian marketplace content.

Problem

Ethiopian e-commerce operates heavily through Telegram channels in Amharic. Extracting structured product information (names, prices, locations) from these messages is challenging due to code-switching, informal language, and the lack of pre-trained NER models for Amharic.

Dataset

Scraped Amharic Telegram messages from EthioMart and similar Ethiopian e-commerce channels. Messages manually annotated with BIO tagging scheme for entities: product name (PRODUCT), price (PRICE), and location (LOC). Dataset includes code-switched Amharic-English text.

Architecture

Fine-tuned multilingual BERT (mBERT) and XLM-RoBERTa on the annotated dataset. Token classification head added on top of the transformer encoder. Evaluated against a baseline CRF model trained on hand-crafted features.

Training

Fine-tuned for 10 epochs with learning rate warmup and weight decay. Early stopping based on validation F1. Data augmentation via back-translation used to expand the limited annotated dataset.

Results

XLM-RoBERTa achieved F1 of 0.82 on the test set, outperforming mBERT (0.76) and the CRF baseline (0.61). Price entity extraction reached F1 of 0.91 due to consistent formatting patterns in the data.

GitHub Repository

View on GitHub