See, here's the thing…
I don't care about other users.
That sounds pointed, but it is exactly true. If I wanted recommendations from other users, I would just accept the fact that up votes or some kind of crowd indicator. Of course, we know by looking at Trending and Hot that it's a broken measure. It clearly doesn't work.
But more importantly, the opinion of other users has nothing to do with my opinion. The whole idea here is that individuation is far more effective at bringing me things that I will like than looking at what other people like and assuming that they are like me.
The Netflix recommendation system starts with a basic assumption that your registered preferences are representative of what you really want. Then they look at the registered preferences of other people and try to find the ones that are nearest to your express preferences in their fairly limited measure space. Then they look for things which other people have found desirable who are near you in preference space, check against whether you've seen it and rated it, and go from there.
The reason that they use the intermediary of other people's preferences is because they don't have the technology, processing power or time that it would take to directly find features in movies that you have rated preferably. Image processing, audio processing – those are hard problems. It's much easier and less data intensive to look at the information which is easy for them to acquire, expressed ratings and preferences.
We don't actually have that problem. We could, if we wanted, create an eigenvector which describes the up vote tendencies based on word frequency of any individuals' history. Then we could find the distance in our vector space of interesting words between those up vote tendencies.
But that wouldn't really be useful for actual discovery. It would only be useful for the discovery of content which someone else has already found and up voted.
I want to go the opposite direction. I want to be able to take a new post which no one has actually seen before and compare it to the eigenvector which describes the things I have historically liked in order to determine if this new piece of content is something that I might like. No amount of information from other people will actually make that more possible. Quite the opposite. It would only pollute the signal of my preferences.
Again, I am working actively to avoid an impositional process which originates on the system. I don't want the system to tell me what other people like, I want the system to be able to look at content and filter it for the things that I might like.
So far today I've managed to extract from the database my likes for an arbitrary period of time in the past, extract the text of those posts and comments, process them into word frequency lists, and I'm now preparing them to go into some kind of semantic analysis – but this is where things start getting hard.
What you're saying makes sense. I understand that you don't want to base the recommendations based on other users.
I do think you need a specific way to characterize your feature vector though. N-grams is one way but it can blow up the dimensionality really quickly. You'll also have to think about how you want to handle images and other media. Do you also want to include some features representing the profile of the person voting?
I would also be careful of creating your own filter bubble. Maybe create two models: one for maximizing the expected value of a recommendation and another for maximizing the maximum value of a set of recommendations . You could then mix core recommendations with discovery.
I understand that these suggestions only add complexity to any system you build so I only offer them as potential improvements.
In any case, I'd be happy to help you test and give feedback for what you build.
I've used N-grams before pretty successfully, and I'm really fond of the fact that they care nothing about source language or even source format if you filter the input properly. In this case, however, I'm just going with a bag of words solution which is essentially a giant dictionary with word frequency tags. That should work well enough for what I'm doing right now. I can always change the processing methodology to output a different vectorized descriptor if I really want to.
The actual dimensionality doesn't matter in this particular case because I have more than enough horsepower to throw at the problem and I'm working with sparse matrices anyway.
I'm stripping out images, HTML in general, URLs, pretty much anything that's not text. Those features are simply not interesting to me. The only features that are important about the person voting are the features from the things that they're voting for.
I want a filter bubble. That's the whole point. The vector space described by the up votes is going to be bulbous enough that effectively any sort of distance measure that involves it is going to involve some slop. That's more than enough to keep fuzzy match diversity fairly high – unless someone is extremely specific about what they vote for, in which case who am I to tell them what they want?
Again, the idea is to get away from the idea that anyone else has the right, ability, or insight to tell you what you like. You have emitted signals. Lots of them, actually, if you look at the features of things that you have voted up. A system shouldn't second guess you and your preferences.
I'm a technophile but I don't adhere to the cult of the machine. A discovery tool should do just that, and in this case specifically be a discovery tool for filtering the bloody firehose of posts which are created on a moment to moment basis by the steem blockchain.
This is a long way from any sort of testing or feedback. Right now it's in the early prototype, feeling around the model, looking for sharp edges sort of place.