I'm working on a problem that involves a large amount of NA's. How does VW work around this? Should I try to impute the NAs with colmeans or something similar before piping into VW format?
To elaborate on my answer:
Lets say the first line of your data is:
y, v1, v2, v3 10, 5, NA, 3
The VW string encoding of that line is:
10 |v1:4 v2:NA v3:3
As you probably discovered
v2:NA doesn't work for VW, as the part after the colon needs to be a number.
An easy solution to this is to find
:NA in your VW string, and replace it with
10 |v1:4 v2_NA v3:3
This will work fine in VW, as it will internally recode
This will allow the model to learn what happens when v2 is NA, and how that differs from the case where it is known.
You could impute medians, but it's probably a better idea to:
This will let VW build a model that predicts one thing for an NA variable and another when it is present.