Usability
9.2/10Clear schema, strong documentation depth, and previewable files make onboarding fast.
Loading page...
Loading page...
Loading page...
A large labeled sentiment dataset for Algerian Arabic (Darija) collected from social media. Contains 120K annotated comments with positive, negative, and neutral labels.
This corpus focuses on naturally occurring social media text from Algerian users and captures dialectal writing styles, code-switching with French, and informal orthography that are common in production NLP workflows.
The dataset contains raw comments, normalized text variants, label annotations, confidence levels from annotators, and optional platform/time metadata to support bias checks and temporal drift analysis.
Curated in collaboration with LMCS annotators and language volunteers across Algiers, Oran, and Constantine.
3
| File | Size | Description |
|---|---|---|
| main.csv | 210 MB | Primary data split for training and analysis. |
| part_02.csv | 21 MB | Supplementary partition 2 for validation and incremental updates. |
| part_03.csv | 18 MB | Supplementary partition 3 for validation and incremental updates. |
Browse files, inspect rows, and sort columns before downloading.
Files
Select a file to preview.
| Column | Type | Completeness | Description |
|---|---|---|---|
| comment_id | string | 100% | Stable unique identifier for each comment. |
| raw_text | text | 100% | Original user comment before cleaning or normalization. |
| normalized_text | text | 98.7% | Normalized variant with punctuation cleanup and spelling harmonization. |
| sentiment_label | categorical | 100% | Final consensus class: positive, negative, or neutral. |
| label_confidence | float | 97.2% | Agreement-derived confidence score in range [0, 1]. |
| posted_at | datetime | 94.9% | Timestamp of the original post when available. |
Clear schema, strong documentation depth, and previewable files make onboarding fast.
Coverage is broad, but underrepresented dialect regions remain a known gap.
Expert-guided annotation and transparent confidence fields improve trust.
Monthly refresh cadence keeps labels relevant to evolving language trends.
Amina B.
NLP Engineer
12 Jan 2026
★★★★★
Excellent baseline dataset for Darija sentiment. The normalized_text field alone saved us weeks of cleanup.
Yacine R.
Research Assistant
3 Dec 2025
★★★★☆
Great annotation quality and clear task setup. Would love a dedicated sarcasm sub-label for edge cases.
Nadia K.
Data Scientist
29 Oct 2025
★★★★★
Very practical for production experimentation. Metadata and confidence signals are useful for QA.