Can’t wait to read about it! Previously I used LDA for topic modeling with several hundreds of SEC’s filings and the results were not as impressive as some of the tutorials using sources (like twitter tweets) that indeed covered a wide variety of topics. With the SEC filings of one particular form type (8-K), I saw quite some overlap among the topics. I wonder if that’s because the majority of my sample documents concentrates on financial theme. So I’d love to see how LDA, combined with doc similarity, is applied to analyze hotel descriptions.
A related question is, without knowing how many “topics” there are in the documents, how do you determine the appropriate number of topics upfront? Is it a trial-and-error process? Yet with what metrics to evaluate whether we are on the right track? Thanks in advance!