Thanks, though at this point I think that beyond a very basic handful, I'm simply going to let the corpus decide what words are not significant enough indicators to be useful guides.
If the SVD can't figure out at least that much, it really has no hope of determining what words are useful enough to be used as discriminators, after all.
I don't necessarily agree with that. My linear algebra isn't great but in probabilistic terms, if you know a solid prior, you should apply it to your model so it has to do less work.
In this case I literally want it to do the work to prove that it can do the work.
At least on a first pass.
If it can't do that much, if it can't at least determine what the least useful things are – we need another method. Maybe a smarter method, maybe a dumber method, but a method that can actually determine something fairly elemental.
Who knows? It may be that the frequency or offset of what we would normally think of as junk words are actually useful in making some sort of determination. Complex discovery like this isn't a finished science quite yet.
Though I may take a break and just try to put together something based on some sort of limited, spreading activation energy friend network thing. I am way better with graph theory that I am with this stuff off-the-cuff.
Unfortunately, I don't think there are too many people working on this sort of thing in the steem space, so I guess everyone is stuck with me.