Datawhale X Tianchi NLP - News Text Classification
TASK 1
Dataset
Train
Number of training samples: 200, 000
Number of classes: 14
Test A
Number of training samples: 50, 000
Task
News text classification
Metrics
F1 score \(\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\)
Challenge
- Class imbalance
- Text is encoded, that is hard to be interpreted, i.e., proper embedding or pretaining matter
Proposed Solutions
- TF-IDF (LSI) + ML classifier
- fastText
- Word2Vec (Skim-Gram) + DL (textCNN, textRNN, BiLSTM…)
- pretaining, BERT