Datawhale X Tianchi NLP - News Text Classification

Jul 21, 2020 • Yi Zhang

TASK 1

Dataset

Train

Number of training samples: 200, 000

Number of classes: 14

Test A

Number of training samples: 50, 000

Task

News text classification

Metrics

F1 score \(\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\)

Challenge

Class imbalance
Text is encoded, that is hard to be interpreted, i.e., proper embedding or pretaining matter

Proposed Solutions

TF-IDF (LSI) + ML classifier
fastText
Word2Vec (Skim-Gram) + DL (textCNN, textRNN, BiLSTM…)
pretaining, BERT

Reference