How about I write a script that goes through every article ever posted on the Steem blockchain and gives me the median length in words of the most well performing posts?
In this article, I will walk you through my thought process while I write a small script that will give us the median length in words of the top trending posts. Keep in mind that my goal is to start learning a bit more about this platform, so my process will be mostly oriented towards that.
If we split the task in smaller steps, we get:
- Go through every article
- Keep only the top 10 posts
- Count the number of words for each
- Calculate the median
Let's clarify every step by going through them one by one.
1. Go through every article
So far, I'm pretty sure that the articles are stored on the blockchain.
But I don't really know how they're stored. There's probably all kinds of things in there, like upvotes, comments, follows, maybe even files. At this point, I'm not even sure if article
is there right term for what I am referring to in the context of Steem.
The most obvious place to starting looking is probably the developer documentation. Reading through some of it makes me realize that I will need one of two things:
- Access to a web service that lets me do queries against the whole database. That service would need to have some notion of an article is.
- A copy of the whole database so that I can make queries against it myself. I'm not yet sure how big the database is, and I'd be surprise if I can run that on my computer (a laptop that's low on disk space)
Seeing as there seems to be a few available public API nodes, I should be able to make the first option work. Remember to be respectful when making API calls to a public API, especially if it's run by a member of your community!
So let's grab steem-js
from npm and try this out.
2. Keep only the top 10 posts
To keep this simple, I decided to use the posts featured on the '/trending' page as my top 10. Let me show you a little trick if you want to potentially avoid having to read an api doc.
Go to https://steemit.com/trending, open the devtools, select the network tab and reload the page. Filter only the XHR calls. At first I thought we'd be looking for a request with the GET method because we're asking for data. But then I remembered that this is a jsonrpc
API, so every request uses the POST method.
The second one I had in my list had the following request body:
{
"id": 1,
"jsonrpc": "2.0",
"method": "overseer.pageview",
"params": {
"page": "/trending/",
"referer": "https://www.google.ca/"
}
}
And the third has this one:
{
"id": 1,
"jsonrpc": "2.0",
"method": "call",
"params": [
"database_api",
"get_state",
[
"/trending/"
]
]
}
So we're calling the overseer.pageview
with the '/trending/' endpoint as the only parameter. This gives us an empty result list.
But we also call the call
method with get_state
and ['/trending/']
as arguments. This gives us an object with a bunch of interesting fields.
I assume the one we're after is content
. It's a map with a key in the format :user/:title
which are assume are posts.
Each post has a field called body
. It's a markdown string. Great!
But wait. That string seems to be the same length for every post. I'm assuming that's because this field is used to display a post's preview. So we'll have a make a separate query to get the full content of each post.
3. Count the number of words for each
Let's write a little script that makes that calls get_state
with the /trending/
route, queries get_content
of each resulting post (mapping the content body) and reduces the content to a word count. We can then get some stats from those word counts, like lowest, highest and our median.
For this we'll need a few libraries, no need to reinvent the wheel here.
I'm gonna use remove-markdown
to strip the markdown string of tags like links and images, and mathjs
for it's median
function. I'm also going to use run-parallel
to make concurrent calls and save some time.
I chose the /trending
route to keep things simple for now. There's probably a better way to do this, but it's going to be good enough for our little exercise.
var steem = require('steem')
var removeMarkdown = require('remove-markdown')
var math = require('mathjs')
var parallel = require('run-parallel')
module.exports = fetchMedian
function fetchMedian (callback) {
steem.api.getState('/trending/', function (err, result) {
if (err) return callback(err)
var posts = Object.keys(result.content).map(key => result.content[key])
var jobs = posts.map(fetchContentJob)
parallel(jobs, function (err, contents) {
if (err) return callback(err)
var wordCounts = contents.map(content => countWords(content.body))
callback(null, {
lowest: wordCounts.reduce((a, b) => Math.min(a, b), Infinity),
highest: wordCounts.reduce((a, b) => Math.max(a, b), -Infinity),
median: math.median(wordCounts)
})
})
})
}
function fetchContentJob (post) {
return done => steem.api.getContent(post.author, post.permlink, done)
}
function countWords (text) {
var letters = removeMarkdown(text)
return letters.split(' ')
.filter(word => word.trim().length > 0)
.length
}
4. Calculate the median
Let's see what this gives us!
We can add a few lines to our script so we can use it from the command line:
if (require.main === module) {
fetchMedian(console.log)
} else {
module.exports = fetchMedian
}
// ...
And now we can simply call it like this: *drum rolls*
$ node trending-count
And our result:
{
lowest: 12,
highest: 2722,
median: 337.5
}
Mission accomplished!
Of course, this being an exercise, there is little scientific value in the data we just obtained. Here are some things one might want to look into to improve this:
- Define what we mean by a top post. Is it top grossing? Is it number of comments? There's many different metrics we could decide to look at.
- Get more data. We used only 10 posts, which is NOT statistically significant.
- Filter out noise. The post with the lowest word count had 12 words. I somehow doubt that this is meaningful data. My guess is it's probably a link or a piece of media of some sort.
- Track the data over time. The results could vary wildly from one day to the next. Having data points over multiple weeks would give us a better idea of what's really happening.
- Refine the word count function. The way we are counting words is very primitive. It might work if an article is written in a predictable way, but by quickly glancing at the contents I noticed it's not that simple. For example, this very post has a bunch of code snippets in it. What should we do with those? Do the words in the code count in our total number of words? Answering questions like this would help us refine our methodology and give us more meaningful results
This was a very interesting little experiment. I hope it gave you a better idea of how we can approach challenges like this!
Let me know what your thoughts are in the comments! See you next time.
Great work! I see you are getting up to speed with steem JS.
I believe that the getDiscussionsByTrending api method could bring up to 100 posts with their full bodies. That would save you a whole bunch of calls right now :)
https://github.com/steemit/steem-js/blob/master/doc/README.md#get-discussions-by-trending
Hey thanks! :)
I like this...I just lost you at checking inside the console XHR calls.
I got there but don't know where you got the rest of the info.
Maybe I could cover it in more details in a future post?
I find it really interesting to try to write down about how I do things while doing them. It think that was the the most challenging part.
maybe screenshots would help?
For my eyes this looks like alot of work.
Well done
I would be more intersted in the time of day the posts are done and the payout.
Congratulations @kareniel! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
Award for the number of comments
Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word
STOP
Congratulations @kareniel! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :
You got a First Reply
Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word
STOP
I love the part where the voice reads out javascript code 😂 😂 😂
Nice work on this bot!