Please explain how NaN's are treated in pandas because the following logic seems "broken" to me, I tried various ways (shown below) to drop the empty values.
My dataframe, which I load from a CSV file using
read.csv, has a column
comments, which is empty most of the time.
marked_results.comments looks like this; all the rest of the column is NaN, so pandas loads empty entries as NaNs, so far so good:
0 VP 1 VP 2 VP 3 TEST 4 NaN 5 NaN ....
Now I try to drop those entries, only this works:
All these don't work:
marked_results.comments.dropna()only gives the same column, nothing gets dropped, confusing.
marked_results.comments == NaNonly gives a series of all
Falses. Nothing was NaNs... confusing.
marked_results.comments == nan
I also tried:
comments_values = marked_results.comments.unique() array(['VP', 'TEST', nan], dtype=object) # Ah, gotya! so now ive tried: marked_results.comments == comments_values # but still all the results are Falses!!!
You need to test
math.isnan() function (Or
numpy.isnan). NaNs cannot be checked with the equality operator.
>>> a = float('NaN') >>> a nan >>> a == 'NaN' False >>> isnan(a) True >>> a == float('NaN') False
Help Function ->
isnan(...) isnan(x) -> bool Check if float x is not a number (NaN).
You should use
notnull to test for NaN (these are more robust using pandas dtypes than numpy), see "values considered missing" in the docs.
Using the Series method
dropna on a column won't affect the original dataframe, but do what you want:
In : df Out: comments 0 VP 1 VP 2 VP 3 TEST 4 NaN 5 NaN In : df.comments.dropna() Out: 0 VP 1 VP 2 VP 3 TEST Name: comments, dtype: object
dropna DataFrame method has a subset argument (to drop rows which have NaNs in specific columns):
In : df.dropna(subset=['comments']) Out: comments 0 VP 1 VP 2 VP 3 TEST In : df = df.dropna(subset=['comments'])