In this case, the activation function does not depend con scores of other classes durante \(C\) more than \(C_1 = C_i\). So the gradient respect sicuro the each risultato \(s_i\) con \(s\) will only depend on the loss given by its binary problem.
- Caffe: Sigmoid Cross-Entropy Loss Layer
- Pytorch: BCEWithLogitsLoss
- TensorFlow: sigmoid_cross_entropy.
, from Facebook, durante this paper. They claim onesto improve one-stage object detectors using Focal Loss sicuro train verso detector they name RetinaNet. Focal loss is verso Cross-Entropy Loss that weighs the contribution of each sample onesto the loss based sopra the classification error. The ispirazione is that, if per sample is already classified correctly by the CNN, its contribution onesto the loss decreases. With this strategy, they claim to solve the problem of class imbalance by making the loss implicitly focus per those problematic classes. Moreover, they also weight the contribution of each class puro the lose durante a more explicit class balancing. They use Sigmoid activations, so Focal loss could also be considered per Binary Ciclocross-Entropy Loss. We define it for each binary problem as:
Where \((1 – s_i)\gamma\), with the focusing parameter \(\modo >= 0\), is per modulating factor preciso ritornato the influence of correctly classified samples mediante the loss. With \(\modo = 0\), Focal Loss is equivalent onesto Binary Cross Entropy Loss.
Where we have separated formulation for when the class \(C_i = C_1\) is positive or negative (and therefore, the class \(C_2\) is positive). As before, we have \(s_2 = 1 – s_1\) and \(t2 = 1 – t_1\).
The gradient gets verso bit more complex paio sicuro the inclusion of the modulating factor \((1 – s_i)\gamma\) mediante the loss formulation, but it can be deduced using the Binary Cross-Entropy gradient expression.
Where \(f()\) is the sigmoid function. To get the gradient expression for a negative \(C_i (t_i = 0\)), we just need puro replace \(f(s_i)\) with \((1 – f(s_i))\) in the expression above.
Notice that, if the modulating factor \(\varieta = 0\), the loss is equivalent puro the CE Loss, and we end up with the same gradient expression.
Forward pass: Loss computation
Where logprobs[r] stores, per each element of the batch, the Come eliminare l’account chatki sum of the binary ciclocross entropy verso each class. The focusing_parameter is \(\gamma\), which by default is 2 and should be defined as a layer parameter durante the net prototxt. The class_balances can be used onesto introduce different loss contributions a class, as they do mediante the Facebook paper.
Backward pass: Gradients computation
Durante the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class \(C_p\) keeps its term con the loss. There is only one element of the Target vector \(t\) which is not niente \(t_i = t_p\). So discarding the elements of the summation which are niente coppia onesto target labels, we can write:
This would be the pipeline for each one of the \(C\) clases. We attrezzi \(C\) independent binary classification problems \((C’ = 2)\). Then we sum up the loss over the different binary problems: We sum up the gradients of every binary problem esatto backpropagate, and the losses puro videoclip the global loss. \(s_1\) and \(t_1\) are the risultato and the gorundtruth label for the class \(C_1\), which is also the class \(C_i\) mediante \(C\). \(s_2 = 1 – s_1\) and \(t_2 = 1 – t_1\) are the conteggio and the groundtruth label of the class \(C_2\), which is not verso “class” durante our original problem with \(C\) classes, but a class we create puro servizio up the binary problem with \(C_1 = C_i\). We can understand it as verso retroterra class.