DIY Steemit Statistics with Python: Part 8 - Daily Hit Parades

in #howto7 years ago

Today we will be reproducing @arcange's Daily Hit Parades. There won't be any visualization-related code, but we will learn about recursive common table expressions along the way.

Preparation

Before we start, we prepare the workspace as usual (see the previous posts in the series for additional context: 1, 2, 3, 4, 5, 6, 7):

%matplotlib inline
import sqlalchemy as sa, pandas as pd, seaborn as sns, matplotlib.pyplot as plt

sns.set_style()
e = sa.create_engine('mssql+pymssql://steemit:[email protected]/DBSteem')

def sql(query, index_col=None):
    return pd.read_sql(query, e, index_col=index_col)

Top Posts by Upvotes

As we should remeber from the previous episode, SteemSQL helpfully tracks the current status of all posts and comments in the Comments table. The number of upvotes is stored in the net_votes field. Obtaining the list of top posts by upvotes at the current moment is therefore pretty straightforward:

sql("""
select top 5 author, permlink, net_votes
from Comments
where 
    cast(created as date) = '2017-09-01' -- Limit to the chosen day
    and parent_author = ''               -- Limit to posts only
order by net_votes desc
""")
author permlink net_votes
0 kingscrown epic-litecoin-day-ath-and-predictions 838
1 roelandp steemfest-updates-new-names-announced-bitfilm-... 825
2 kingscrown btc-rallies-to-5000-usd-biggest-gainers-and-lo... 728
3 minnowbooster minnowbooster-tutorial-4-leasing-delegated-ste... 614
4 jerrybanfield 19-2-update-new-backup-server-steemj-progress-... 611

To get a "beginner's hit-parade" we could limit the rating to users with a reputation below 50. Recall from part 5 of our series that we can get this value from the Accounts table.

sql("""
select top 5 author, permlink, net_votes
from Comments c
join Accounts a on a.name = c.author
where 
    cast(c.created as date) = '2017-09-01' -- Limit to the chosen day
    and c.parent_author = ''               -- Limit to posts only
    and log10(a.reputation)*9 - 56 < 50    -- Limit by reputation    
order by net_votes desc
""")
author permlink net_votes
0 jessymg in-may-romantic-ronda 372
1 logan6230 oracion-del-viernes-01-de-septiembre-de-2017 367
2 holadeivi noche-de-fotografia-increible-retratos-a-una-c... 362
3 pedrolopez with-god-everything-is-possible-dios-tiene-un-... 360
4 daneftaly lakajat-tomar-llevar-recoger 358

Top Posts by Pending Payouts

The situation is the same with pending payouts. This time we will simply sum the pending_payout_value and total_pending_payout value in the query. To add a bit more to the query, let us also construct the correct link to the post and extract its title:

sql("""
select top 5
    concat('@', author) as Author,
    concat('http://steemit.com/', 
                category, '/@', author, 
                '/', permlink) as Link,
    title as Title,
    pending_payout_value + total_pending_payout_value as Payout
from Comments c
where 
    cast(c.created as date) = '2017-09-01'
    and parent_author = ''               -- Limit to posts only
order by Payout desc
""")
Author Link Title Payout
0 @roelandp http://steemit.com /steemfest/@roelandp /steemfe... SteemFest² Updates - New Names Announced - Bit... 707.72
1 @jeffberwick http://steemit.com /conspiracy/@jeffberwick /lig... Light on Conspiracies - Don't Fall For The Hoa... 471.22
2 @timsaid http://steemit.com /photography/@timsaid /timtra... TimTravels - Preveli and Kournas Lake - Day 3 387.10
3 @kingscrown http://steemit.com /litecoin/@kingscrown /epic-l... Epic Litecoin Day - ATH and Predictions 377.23
4 @gavvet http://steemit.com /war/@gavvet /from-the-stone-... From the stone age to modern warfare in a coup... 371.35

Getting the "beginner's" version is as simple as before, so this part is left as an exercise to the reader.

Top Posts by Comments

Counting total comments for the post is a bit more interesting. The Comments table (as well as TxComments) does not immediately tell us the total number of comments. Instead, it provides the fields parent_author and parent_permlink, which point to the parent of every comment.

In particular, as you must have noted above, we always use the condition parent_author = '' to limit queries to posts only - those are essentially "comments without parents".

Here is, for example, the record corresponding to this post, note the empty parent_author:

sql("""
select author, permlink, 
       parent_author, parent_permlink 
from Comments 
where id = 11243470
""")
author permlink parent_author parent_permlink
0 konstantint diy-steemit-statistics-with-python-part-7-payouts python

The first-level children of a post have their parent_author and parent_permlink set to the author and permlink of the corresponding record. For the example post above we could extract them as follows:

sql("""
select c1.author, c1.permlink, 
       c1.parent_author, c1.parent_permlink
from Comments c0
join Comments c1 on (c1.parent_author = c0.author and
                     c1.parent_permlink = c0.permlink)
where c0.id = 11243470
""")
author permlink parent_author parent_permlink
0 neowne re-konstantint-diy-steemit-statistics-with-pyt... konstantint diy-steemit-statistics-with-python-part-7-payouts
1 stickchumpion re-konstantint-diy-steemit-statistics-with-pyt... konstantint diy-steemit-statistics-with-python-part-7-payouts
2 paulag re-konstantint-diy-steemit-statistics-with-pyt... konstantint diy-steemit-statistics-with-python-part-7-payouts

We could proceed in the same way to obtain second-level comments for the post:

sql("""
select c2.author, c2.permlink, 
       c2.parent_author, c2.parent_permlink
from Comments c0
join Comments c1 on (c1.parent_author = c0.author and
                     c1.parent_permlink = c0.permlink)
join Comments c2 on (c2.parent_author = c1.author and
                     c2.parent_permlink = c1.permlink)                     
where c0.id = 11243470
""")
author permlink parent_author parent_permlink
0 konstantint re-stickchumpion-re-konstantint-diy-steemit-st... stickchumpion re-konstantint-diy-steemit-statistics-with-pyt...
1 konstantint re-paulag-re-konstantint-diy-steemit-statistic... paulag re-konstantint-diy-steemit-statistics-with-pyt...

Now, if we need to count all comments for the post, we must somehow traverse all the levels of the hierarchy, collecting them together.

Recursive Common Table Expressions to the rescue! Read the query below carefully:

sql("""
with AllComments as (

    -- Initialize "AllComments" with
    -- a single post we want to start with:
    
    select 0 as level, author, permlink
    from Comments
    where id = 11243470
    
    -- Append children of whatever 
    -- we collected so far into the
    -- "AllComments" table:

    union all
    
    select g.level+1 as level, c.author, c.permlink
    from AllComments g
    join Comments c on (g.author = c.parent_author 
                    and g.permlink = c.parent_permlink)
)
select * from AllComments
""")
level author permlink
0 0 konstantint diy-steemit-statistics-with-python-part-7-payouts
1 1 neowne re-konstantint-diy-steemit-statistics-with-pyt...
2 1 paulag re-konstantint-diy-steemit-statistics-with-pyt...
3 1 stickchumpion re-konstantint-diy-steemit-statistics-with-pyt...
4 2 konstantint re-stickchumpion-re-konstantint-diy-steemit-st...
5 2 konstantint re-paulag-re-konstantint-diy-steemit-statistic...
6 3 paulag re-konstantint-re-paulag-re-konstantint-diy-st...

With this trick in our hat, counting all comments for each post on a given day and selecting the top-N is a matter of some additional aggregation:

sql("""
with AllComments as (
    select id, author, permlink
    from Comments
    where 
        parent_author = ''
        and cast(created as date) = '2017-09-01'

    union all
    
    select g.id, c.author, c.permlink
    from AllComments g
    join Comments c on (g.author = c.parent_author 
                    and g.permlink = c.parent_permlink)
),

AllCommentCount as (
    select id, count(*)-1 as NumComments
    from AllComments
    group by id
)

select top 5 
    c.author, c.permlink, NumComments
from AllCommentCount cc
join Comments c on cc.id = c.id
order by NumComments desc
""")
author permlink NumComments
0 rivalhw 5jdadw-steemit 217
1 roelandp steemfest-updates-new-names-announced-bitfilm-... 183
2 kaylinart steemit-bought-me-a-car-holy-cow-thank-you-so-... 180
3 kus-knee the-old-dog-asks-have-you-been-ripped-off-yet-... 175
4 timsaid timtravels-preveli-and-kournas-lake-day-3 146

That's it for now with the hit-parades. Try practicing the queries to figure out what is the most commented or the most upvoted post of all time (we already know which is the post with the highest payout of all time from the previous episode of this series).

The source code of this post is also available as a Jupyter notebook.

Sort:  

@OriginalWorks Mention Bot activated by @paulag. The @OriginalWorks bot has determined this post by @konstantint to be original material and upvoted it!

OW2.gif

To call @OriginalWorks, simply reply to any post with @originalworks or !originalworks in your message!

For more information, Click Here!

on steemit to the novices help? such as me?

I did not understand the question, but feel free to ask for specific advice here, for example.

I can see now that SteemSQL is going to be incredibly useful - amazing post, thank you! :)