In multi-class clasification tasks, there is always 2 kinds of trade-offs. One is the precision and recall: when tuning a classifier, improving the precision score often results in lowering the recall score. The other is the trade-off between classes. To address this problem, there are 3 main variants of F1 scores as design options.

F1 Score on Single Class

Let’s look the basic F1-score first. It is computed using a Harmonic Mean of perception ($P$​​) and recall ($R$​​​​): $$ \text{F1} = 2 \times \frac{P \times R}{P+R} $$ Similar to Arithmetic Mean, it will always be somewhere in between precision and mean. But it behaves differently that it would implicitly give a larger weight to lower numbers. For example, when Precision is 100% and Recall is 0%, the F1-score will be 0%, not 50%.

F1 Score on Multi Class

Now, lets say we have $N$ classes and each class will have a $F1_{i \in N}$ scores. In order to have an single scaler to indicate the model performance, we usually have simple average on multiple F1 scores. Macro-F1 is the most common F1 score that we gave equal weights to each class.

$$ \text{Macro-F1} = \frac{\sum_{i}^{N} F1_{i}}{N} $$

However, we probability don’t want to do that if the data is highly imbalance. Weighted-F1, we weight the F1-score of each class by the number of samples from that class $i$, denote as $W_{i}$ $$ \text{Weighted-F1} = \frac{\sum_{i}^{N} W_{i} \times F1_{i}}{N} $$ There is an alternative way of doing it, called Micro-F1. The idea is rather then computing F1 score for each class, lets ignore the class. We first compute the overall perception and overall recall. They are also called Micro perception ($P_{m}$) and **Micro recall **($R_{m}$). Then we apply them into F1 score function.
$$ \text{Micro F1} = 2 \times \frac{P_{m} \times R_{m}}{P_{m}+R_{m}} $$

Summary

As mentioned earlier there are 3 main variants of F1 scores that giving different tastes for evaluating model performance. The key concept is, just like design loss function, the relative importance assigned to precision and recall or to different classes should be task-oriented. Sometimes, we would customize the Weighted-F1 score by some heuristics or priors.