TASK 1

Dataset

Train

Number of training samples: 200, 000

Number of classes: 14

Test A

Number of training samples: 50, 000

Task

News text classification

Metrics

F1 score \(\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\)

Challenge

  • Class imbalance
  • Text is encoded, that is hard to be interpreted, i.e., proper embedding or pretaining matter

Proposed Solutions

  1. TF-IDF (LSI) + ML classifier
  2. fastText
  3. Word2Vec (Skim-Gram) + DL (textCNN, textRNN, BiLSTM…)
  4. pretaining, BERT

Reference