I actually don't care about topics in common usage. "Topics" is the term that gensim uses to refer to each dimension of differentiation in the vector space of the abstraction which summarizes each individual document.
Essentially, I want to create a high dimensional vector which describes/summarizes/distills the documents into something that can be given a distance. Each eigenvector is really a descriptive point in high dimensional space where that document is located.
I'm trying to figure out, given a set of documents, handed a brand-new document that's never been seen before, how far is that document from the cloud of points which represent the documents I already have?
Ultimately, it's intended to be a lens through which you can look at posts to the steem blockchain and get an estimation of how likely any individual post is to be something that you would be interested in up voting/reading.
Essentially I'm trying to run a classification system in reverse. That makes it an interesting problem.
Ok, I see. It will be useful, too much noise to filter out.
The summarization of documents is probably the difficult part to have a good, small and accurate enough representation on which to filter onto.
One of the big problems with Steemit as I see it is the fact that trying to find content that you're interested in is like sipping from a fire hose. One directed straight into your face.
My current plan, at least in this sketch code, is to fetch the text of everything that you've uploaded over the last time period (currently 30 days), tokenize it, create a set of eigenvectors which describe all of those documents, magic goes here in determining an average eigenvector, and then take any post to the blockchain, process it through the same vector space, I then see how close it is to the things that you've liked. If it's over some arbitrary threshold, let you know about it.
I actually have my code set up to allow me to switch between n-gram extractions and word tokens on a whim, so at least I'll be able to test both of those to see if one consistently gives me better stuff than the other.
This is experimental programming. It's like mad science but with slightly fewer explosions.
Hmm, are you sure the average eigen vector thing would work?
Is your plan just to compute the LSA on the set of your own posts (in a 30 day period)? And then determine how close other posts are? I guess the dataset will most likely be too small. Instead of filtering noise, the LSA might even enhance it unless you are a posting machine (could this explain why you end up with quite common words in your topics?).
Or do you wish to compute the LSA on many posts (let's say all Steemit publications of last month) and try to infer an average representation of the subset of your own posts? Even then I don't know if this works. What would happen if half of your posts are about cryptocurrency and the other half about vaccines (:-0). Presumably, these would be projected into different parts of the LSA space. The average would be meaningless here (maybe something like prepper homeopathy?). Maybe it's better to compute the similarity to all of your recent posts individually at first and then take the average, or median, or some percentile to determine if it's worth reading and may cater to your interests.
If you are still in favor of averaging your posts and compute your interest vector directly, another approach could be to take a look at Doc2Vec. There the average or sum of word and document vectors seem to work kinda well. Still, as before, you might end up somewhere in the Doc2Vec space that is just the middle empty ground between different posts of yours. Moreover, Doc2Vec is incredibly data hungry and requires a couple of 10k or better 100k documents. Fortunately, as you said, Steemit is a fire hose, so that should be the least of your concerns.
I haven't still fully grasped what you are trying to do, sorry if I misunderstood you. Anyway, I'm curious how your experiment progresses because I want to do something similar. I started with a bot that predicts payouts of posts and I am curious if the LSA part of my bot could potentially be used for content recommendations as well.