F1-score is one of the most important evaluation metrics in machine learning. It elegantly sums up the predictive performance of a model by combining two otherwise competing metrics — precision and recall. This post is written as an extension of my two previous posts on accuracy, precision and recall, and I encourage you to check…

F1-score is without doubt one of the most essential analysis metrics in machine studying. It elegantly sums up the predictive efficiency of a mannequin by combining two in any other case competing metrics — precision and recall. This submit is written as an extension of my two earlier posts on accuracy, precision and recall, and I encourage you to examine them out!On this submit, I’ll cowl all of the important issues you might want to find out about F1-score. I’ll set the context by explaining when to make use of both precision or recall. Subsequent, I’ll outline F1-score and clarify when it ought to be used. Lastly, I’ll examine the several types of averaging two numbers — together with harmonic, geometric and arithmetic means — and talk about why F1-score relies on harmonic imply.It isn’t attainable to debate about F1-score with out first setting the context with precision and recall. In gist, precision and recall are metrics that assist us consider the predictive efficiency of a classification mannequin on a selected class of curiosity, also called the optimistic class.Precision: Of all optimistic predictions, what number of are actually optimistic?Recall: Of all actual optimistic circumstances, what number of are predicted optimistic?Formally, precision and recall are outlined as:So, when do you have to use precision, and when do you have to use recall? If you consider it, precision and recall each measure a mannequin’s predictive efficiency, however in several methods. Precision measures the extent of error brought on by False Positives (FPs) whereas recall measures the extent of error brought on by False Negatives (FNs). Subsequently, to determine which metric to make use of, we should always assess the relative influence of those two sorts of errors on our use-case. Thus, the important thing query we ought to be asking is:“Which sort of error — FPs or FNs — is extra undesirable for our use-case?”Let’s contextualise this by revisiting our most cancers prediction instance. Determine 1 exhibits the confusion matrix summarising the hypothethical prediction outcomes. Out of the 4 situations, Eventualities #2 and #3 are undesirable.Determine 1: Confusion matrix for most cancers prediction (Picture by Writer)State of affairs #2 represents FPs. Of 900 sufferers who actually don’t have most cancers, the mannequin says 80 of them do. These 80 sufferers will in all probability bear costly and pointless remedies, on the expense of their well-being.State of affairs #3 represents FNs. Of 100 sufferers who actually have most cancers, the mannequin says 20 of them don’t. These 20 sufferers would go undiagnosed and fail to obtain correct remedy.Between these two situations, which is extra undesirable? We might argue that it’s State of affairs #3. It’s in all probability worse to not obtain any remedy, which might place one’s life at risk, than to obtain pointless remedy. Because the influence of errors brought on by FNs is assessed to be extra vital, it is sensible to pick a mannequin that has as few FNs as attainable. In different phrases, we should always use recall as a substitute of precision.When do you have to use precision, then? Many real-world datasets are sometimes not labelled, i.e. we have no idea which class every remark belongs to. Is an electronic mail a spam or a ham? Is an article faux information or not? Will the shopper churn? One of many key advantages of utilizing machine studying to categorise in such use-cases is to scale back the quantity of human effort required. Thus, for all observations that the mannequin predicts as optimistic, we might need as a lot of them to be actually optimistic. In different phrases, we would like our mannequin to be as exact as attainable. In such situations, precision ought to be used over recall.It’s also attainable that to your use-case, you assess that the errors brought on by FPs and FNs are (nearly) equally undesirable. Therefore, it’s possible you’ll want for a mannequin to have as few FPs and FNs as attainable. Put in a different way, you’ll wish to maximise each precision and recall. In apply, it’s not attainable to maximise each precision and recall on the identical time due to the trade-off between precision and recall.Growing precision will lower recall, and vice versa.So, given pairs of precision and recall values for various fashions, how would you examine and determine which is the most effective? The reply is — you guessed it — F1-score.By definition, F1-score is the harmonic imply of precision and recall. It combines precision and recall right into a single quantity utilizing the next system:This system will also be equivalently written as,Discover that F1-score takes each precision and recall under consideration, which additionally means it accounts for each FPs and FNs. The upper the precision and recall, the upper the F1-score. F1-score ranges between 0 and 1. The nearer it’s to 1, the higher the mannequin.Now that we’ve coated the basics, let’s stroll via the pondering course of of selecting between precision, recall and F1-score. Suppose we have now educated three totally different fashions for most cancers prediction, and every mannequin has totally different precision and recall values.Determine 2: Mannequin choice utilizing precision, recall or F1-score (Animated GIF by Writer)If we assess that errors brought on by FPs (State of affairs #2 in Determine 1) are extra undesirable, then we’ll choose a mannequin based mostly on precision and select Mannequin C.If we assess that errors brought on by FNs (State of affairs #3 in Determine 1) are extra undesirable, then we’ll choose a mannequin based mostly on recall and select Mannequin B.Nonetheless, if we assess that each sorts of errors are undesirable, then we’ll choose a mannequin based mostly on F1-score and select Mannequin A.So, the takeaway right here is that the mannequin you choose relies upon vastly on the analysis metric you select, which in flip is dependent upon the relative impacts of errors of FPs and FNs in your use-case.I discussed briefly that F1-score is the “harmonic imply of precision and recall”. What does it imply by harmonic imply? Certainly, there are different methods of mixing two numbers into one… corresponding to arithmetic imply or geometric imply (see Determine 3 for his or her mathematical formulae¹).Determine 3: Harmonic, arithmetic and geometric technique of two numbers (Picture by Writer)For extra detailed details about harmonic, arithmetic and geometric means, I like to recommend the next submit by Daniel McNichol.In the event you do a Google search on why F1-score makes use of harmonic imply, you will discover solutions like “harmonic imply penalises unequal values extra” and “harmonic imply punishes excessive values”. I struggled to know them at first. Since I prefer to simplify ideas, I created an interactive 3D scatter plot in Determine 4 to assist me perceive higher. Basically, this 3D scatter plot compares how harmonic, arithmetic and geometric means range with totally different units of precision and recall values. Be at liberty to mess around with it!Determine 4: 3D interactive chart, illustrating the respective behaviours of harmonic, arithmetic and geometric means (Picture by Writer)p.s. I can’t undergo the code used to supply Determine 4, since that will be outdoors the scope of this submit, however be happy to test it out at my GitHub repo.We are able to make a number of attention-grabbing observations from Determine 4:The three sorts of means are the identical if and provided that precision and recall are equal. Discover how the purple scatter factors intersect with the blue and inexperienced scatter factors solely alongside the diagonal of the precision-recall axis (i.e. when precision = recall).Harmonic and geometric means begin to turn into farther from arithmetic imply when precision and recall will not be equal. Whereas the purple scatter factors kind a flat airplane, the blue and inexperienced scatter factors kind curved planes.The extra unequal precision and recall values are, the decrease the harmonic imply, greater than geometric and arithmetic means. The airplane represented by blue scatter plots for harmonic imply is extra “curved” than these for geometric and arithmetic means.Picture by Alessio Soggetti on UnsplashHow can we make sense of those observations in a method that’s intuitive? Let’s think about ourselves standing on the highest level of Determine 4, the place precision and recall are each equal to 1 and all three sorts of means are 1. Now, suppose we “stroll down” the slopes of the planes alongside the precision axis, i.e. we maintain recall fastened at 1 whereas decreasing precision from 1, to 0.95, to 0.90, to 0.85 and so forth till we attain 0.05. As we stroll, precision decreases, so do the imply values throughout all three sorts of means.Nonetheless, it decreases most sharply for harmonic imply since its airplane is essentially the most “curved”. On the level the place precision is 0.05, we’ll discover ourselves on the lowest level if we’re on the airplane representing harmonic imply than the opposite two. Right here, arithmetic imply is 0.525 and geometric imply is 0.224, however harmonic imply is just 0.095! Now, it makes extra sense to me (and hopefully to you too) what it actually means by the truth that harmonic imply “penalises unequal values extra” or “punishes excessive values”.So, why is F1-score based mostly on harmonic imply? Properly, it’s clear that harmonic imply discourages vastly unequal values and intensely low values. We might need F1-score to offer a fairly low rating when both precision or recall is low and solely harmonic imply permits that. For example, an arithmetic imply of 0.525 or geometric imply of 0.224 when recall is 1 and precision is 0.05 in all probability don’t sufficiently convey the truth that precision may be very low, as a lot as a harmonic imply of 0.095 does. Additionally, utilizing harmonic imply signifies that F1-score might be 0, if both precision or recall is 0.Congratulations! You’ve discovered that the selection between precision, recall or F1-score to judge fashions is dependent upon the relative impacts of FPs and FNs in your use-case. Specifically, if each sorts of errors are undesirable, F1-score could be extra appropriate. As well as, you may have gained a greater instinct behind why F1-score relies on harmonic imply. In fact, there are different analysis metrics in machine studying however I’ve intentionally saved this submit centered primarily on F1-score. Alright then.. keep tuned for my subsequent posts!