Thanks for sharing. This topic is really relevant because I see various types of independent variables mixed together all the time, i.e. numerical variable, dummy variables, category variables, etc.

I haven’t implemented the procedure yet, but a quick question after reading your post, is why casting ‘category_name’ as category type if it is eventually CountVectorized?

Also, I see that you imposed a restriction on document frequency of 10 appearance on ‘name’ but not on ‘category_name’, was that based on your experience/intuition (‘name’ is more likely to have typos than category name’) or based on some data exploration not elaborated in the post?

And finally, do you worry when there are many categorical variables (or dummy variables) in the linear regression (or other algorithm like random forests)? I once read this article below, and it makes me think about the seriousness of this issue. I see dummy variables (a lot!) and numeric variables used together in regression analysis and no one seems to care (at least in my field in academia). So I’d love to hear your thoughts on that.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store