Calculating F-Score, which is the "positive" class, the majority or minority class?

by David Parks   Last Updated May 15, 2018 20:19 PM

I'm calculating the F-Score for a sandbox dataset: 100 medical patients, 20 of which have cancer. Our classifier mis-classifies 20 healthy patients as having cancer, and 5 patients with cancer as healthy, the rest it gets right.

We compute True Positives; True Negatives; False Positives; and False Negatives.

We ran into a debate about which class comes first, those that test "Positive" for cancer, or the majority class, e.g. those that are "Healthy".

Explicit Question: What is the correct true-positive rate in this dataset? Is it:

  1. # of predicted healthy patients over # of actual healthy patients
  2. # of predicted cancer patients over # of actual cancer patients

Bonus points if you can reference some literature that supports one supposition or the other.

Note, I've skimmed through a few texts on f-scores but haven't seen an explicit discussion of this point:

https://en.wikipedia.org/wiki/F1_score http://rali.iro.umontreal.ca/rali/sites/default/files/publis/SokolovaLapalme-JIPM09.pdf

Wikipedias text on precision and recall seem to suggest that "true positive" be defined by whatever "test" is being performed, and thus in this case defined as the minority class because the "test" is for cancer. However I don't find the discussion rigorous enough to convince me. If I simply describe the test in terms of testing for "healthy" patients I change the f-score, but this was just a semantic change. I would expect the f-score to have a mathematically rigorous definition.

https://en.wikipedia.org/wiki/Precision_and_recall



Answers 2


Precision is what fraction actually has cancer out of the total number that you predict positive,

precision = ( number of true positives ) / (number of positives predicted by your classifier)

Recall (or true positive rate) is, what fraction of all predicted by your classifier were accurately identified.

true positive rate = true positives / ( True positive + False negative)

Coming to F-score, it is a measure of trade-off between precision and recall. Lets assume you set the thresh-hold for predicting a positive as very high. Say predicting positive if h(x) >= 0.8, and negative if h(x) < 0.8 you have huge precision but low recall. You have a precision of (15)/(15+20) = 42.8% (15 is the number of true positives 20 total cancerous, subtracted 5 which are wrongly predicted)

If you want to have a high recall [or true positive rate], it means you want to avoid missing positive cases, so you predict a positive more easily. Predict positive if h(x) >= 0.3 else predict negative. Basically having a high recall means you are avoiding a lot of false negatives. Here your true positive rate is ( 15 / (15+5) )= 75%

Having a high recall for cancer classifiers can be a good thing, you totally need to avoid false negatives here. But of course this comes at the cost of precision.

F-score measures this trade-off between precise prediction vs avoiding false negatives. Its definition can be arbitrary depending upon your classifier, lets assume it is defined as the average between precision and true positive rate.

This is not a very good F-score measure because you can have huge recall value, and very low precision [eg predicting all cases positive] and you will still end up with an F-score which is same that when your precision and recall are well balanced.

Define F score as :

              2 * (Precision * Recall) / (Precision + Recall) 

Why? If you have very low precision or recall or both, your F-score falls; and you'll know that something is wrong.

I would advise you to calculate F-score, precision and recall, for the case in which your classifier predicts all negatives, and then with the actual algorithm. If it is a skewed set you might want more training data.

Also note that it is a good idea to measure F score on the cross-validation set. It is also known as F1-score.

http://arxiv.org/ftp/arxiv/papers/1503/1503.06410.pdf

https://www.google.co.in/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=a+probabilistic+theory+of+precision+recall+and+f+score

Ensuis sui Pulverem
Ensuis sui Pulverem
January 24, 2016 05:59 AM

I think you've discovered that the F-score is not a very good way to evaluate a classification scheme. From the Wikipedia page you linked, there is a simplification of the formula for the F-score:

$$ {F1} = \frac {2 {TP}} {2 {TP} + {FP} + {FN}} $$

where $TP,FP,FN$ are numbers of true positives, false positives, and false negatives, respectively.

You will note that the number of true negative cases (equivalently, the total number of cases) is not considered at all in the formula. Thus you can have the same F-score whether you have a very high or a very low number of true negatives in your classification results. If you take your case 1, "# of predicted healthy patients over # of actual healthy patients", the "true negatives" are those who were correctly classified as having cancer yet that success in identifying patients with cancer doesn't enter into the F-score. If you take case 2, "# of predicted cancer patients over # of actual cancer patients," then the number of patients correctly classified as not having cancer is ignored. Neither seems like a good choice in this situation.

If you look at any of my favorite easily accessible references on classification and regression, An Introduction to Statistical Learning, Elements of Statistical Learning, or Frank Harrell's Regression Modeling Strategies and associated course notes, you won't find much if any discussion of F-scores. What you will often find is a caution against evaluating classification procedures based simply on $TP,FP,FN,$ and $TN$ values. You are much better off focusing on an accurate assessment of likely disease status with an approach like logistic regression, which in this case would relate the probability of having cancer to the values of the predictors that you included in your classification scheme. Then, as Harrell says on page 258 of Regression Modeling Strategies, 2nd edition:

If you make a classification rule from a probability model, you are being presumptuous. Suppose that a model is developed to assist physicians in diagnosing a disease. Physicians sometimes profess to desiring a binary decision model, but if given a probability they will rightfully apply different thresholds for treating different patients or for ordering other diagnostic tests.

A good model of the probability of being a member of a class, in this case of having cancer, is thus much more useful than any particular classification scheme.

EdM
EdM
January 27, 2016 16:49 PM

Related Questions


Active Learning with Human-in-the-Loop

Updated August 10, 2018 20:19 PM