I've used N-grams before pretty successfully, and I'm really fond of the fact that they care nothing about source language or even source format if you filter the input properly. In this case, however, I'm just going with a bag of words solution which is essentially a giant dictionary with word frequency tags. That should work well enough for what I'm doing right now. I can always change the processing methodology to output a different vectorized descriptor if I really want to.
The actual dimensionality doesn't matter in this particular case because I have more than enough horsepower to throw at the problem and I'm working with sparse matrices anyway.
I'm stripping out images, HTML in general, URLs, pretty much anything that's not text. Those features are simply not interesting to me. The only features that are important about the person voting are the features from the things that they're voting for.
I want a filter bubble. That's the whole point. The vector space described by the up votes is going to be bulbous enough that effectively any sort of distance measure that involves it is going to involve some slop. That's more than enough to keep fuzzy match diversity fairly high – unless someone is extremely specific about what they vote for, in which case who am I to tell them what they want?
Again, the idea is to get away from the idea that anyone else has the right, ability, or insight to tell you what you like. You have emitted signals. Lots of them, actually, if you look at the features of things that you have voted up. A system shouldn't second guess you and your preferences.
I'm a technophile but I don't adhere to the cult of the machine. A discovery tool should do just that, and in this case specifically be a discovery tool for filtering the bloody firehose of posts which are created on a moment to moment basis by the steem blockchain.
This is a long way from any sort of testing or feedback. Right now it's in the early prototype, feeling around the model, looking for sharp edges sort of place.