Introducing SteemData - A Database Layer for STEEM

furion (70)in #steemdata • 8 years ago

Why

The goal of the SteemData project is to make data from the STEEM blockchain more accessible to developers, researchers and 3rd party services.

Today, most apps use steemd as the source of data. In this context, steemd is used for fetching information about the blockchain itself, requesting blocks, and fetching recent content (ie. new blog posts from a user, homepage feed, etc.)

Unfortunately it also comes with a few shortcomings.

Running steemd locally is very hard, due to its growing RAM requirements. (None of my computers are capable of running it). Which means that we have to rely on remote RPC's, and that brings up another issue: time.

It takes a long time for a round trip request to a remote RPC server (sometimes more than 1 second per request).

Because steemd was never intended for running queries, aggregates, map-reduce, text search, it is not very well equipped to deal with historic data. If we are interested in historic data, we have to get it block-by-block form the remote RPC, which takes a really really long time.

For example, fetching the data required to create a monthly STEEM report now takes more than a week. This is simply not feasible.

Hello MongoDB

I have chosen MongoDB for this project for a couple of reasons:

Mongo is a document-based database, which is great for storing unstructured (schema-less) data.
Mongo has a powerful and expressive query language, ability to run aggregate queries and javascript functions directly in its shell (for example: map-reduce pattern).
By utilizing Mongo's Oplog we can 'subscribe' to new data as well as database changes. This is useful for creating real-time applications.
Steemit Inc is already developing a MySQL based solution, and Microsoft SQL solution exists on http://steemsql.com/

Server

I have setup a preview version of the database as a service. You can access it on:

Host: mongo0.steemdata.com
Port: 27017

Database: Steem
Username: steemit
Password: steemit

The steemit user account is read-only.

I highly recommend RoboMongo as a GUI utility for experimenting with the database.

After you're connected, you can run queries against any collection like this:

Data Layout

Accounts

Accounts contains Steem Accounts and their:

account info / profile
balances
vesting routes
open conversion requests
voting history on posts
a list of followers and followings
witness votes
curation stats

Example
Find all Steemit users that have at least 500 followers, less than $50,000 SBD in cash, have set their profile picture, and follow me (@furion) on Steemit.

db.getCollection('Accounts').find({
    'followers_count': {'$gt': 500},
    'balances.SBD': {'$lte': 50000},
    'profile.profile_image': {'$exists': true},
    'following': {'$in': ['furion']},
    })

Posts

Posts provide us with easy to query post objects, and include content, metadata, and a few added helpers. They also come with all the replies, which are also full Post objects.

A few extra niceties:

body field supports Full Text Search
timestamps are parsed as native ISO dates
amounts are parsed as Amount objects

Example
Find all Posts by @steemsports from October, which have raised at least $200.5 in post rewards and have more than 20 comments and mention @theprophet0 in the metadata.

db.getCollection('Posts').find({
    'author': 'steemsports',
    'created': {
        '$gte': ISODate('2016-10-01 00:00:00.000Z'),
        '$lt': ISODate('2016-11-01 00:00:00.000Z'),
     },
     'total_payout_reward.amount': {'$gte': 200.5},
     '$where':'this.replies.length>20',
     'json_metadata.people': {'$in': ['theprophet0']},
    })

Example 2
Find all posts which mention meteor in their body:

db.getCollection('Posts').find({'$text': {'$search': 'meteor'}})

Operations

Operations represent the entire blockchain, as seen trough a time series of individual actions, such as:

operation_types = [
    'vote', 'comment_options', 'delete_comment', 'account_create', 'account_update',
    'limit_order_create', 'limit_order_cancel',
    'transfer', 'transfer_to_vesting', 'withdraw_vesting', 'convert', 'set_withdraw_vesting_route',
    'pow', 'pow2', 'feed_publish', 'witness_update',
    'account_witness_vote', 'account_witness_proxy',
    'recover_account', 'request_account_recovery', 'change_recovery_account',
    'custom', 'custom_json'
]

Operations have the same structure as on the Blockchain, but come with a few extra fields, such as timestamp, type and block_num.

Example
Find all transfers in block 6717326.

db.getCollection('Operations').find({'type':'transfer', 'block_num': 6717326})

We get 1 result:

{
    "_id" : ObjectId("584eac2fd6194c5ab027f671"),
    "from" : "bittrex",
    "to" : "poloniex",
    "type" : "transfer",
    "timestamp" : "2016-11-14T13:21:30",
    "block_num" : 6717326,
    "amount" : "466.319 STEEM",
    "memo" : "83ad5b2c56448d45"
}

VirtualOperations

Virtual Operations represent all actions performed by individual accounts, such as:

    types = {
        'account_create',
        'account_update',
        'account_witness_vote',
        'comment',
        'delete_comment',
        'comment_reward',
        'author_reward',
        'convert',
        'curate_reward',
        'curation_reward',
        'fill_order',
        'fill_vesting_withdraw',
        'fill_convert_request',
        'set_withdraw_vesting_route',
        'interest',
        'limit_order_cancel',
        'limit_order_create',
        'transfer',
        'transfer_to_vesting',
        'vote',
        'witness_update',
        'account_witness_proxy',
        'feed_publish',
        'pow', 'pow2',
        'liquidity_reward',
        'withdraw_vesting',
        'transfer_to_savings',
        'transfer_from_savings',
        'cancel_transfer_from_savings',
        'custom',
    }

Operations have the same structure as in the steemd database, but come with a few extra fields, such as account, timestamp, type, index and trx_id.

Example:
Query all transfers from @steemsports to @furion in the past month.

db.getCollection('VirtualOperations').find({
    'account': 'steemsports',
    'type': 'transfer',
    'to': 'furion',
    'timestamp': {
        '$gte': ISODate('2016-10-01 00:00:00.000Z'),
        '$lt': ISODate('2016-11-01 00:00:00.000Z'),
    }})

TODO

[] Historic 3rd party price feeds (partially done)
[] add Indexes based on usage patterns (partially done)
[] parse more values into native data types
[] create relationships using HRefs
[] Create Open-Source Server (Python+Docker based)
[] Create Open-Source Client Libraries (Python, JS?)

Looking for feedback and testers

I would love to get community feedback on the database structure, as well as feature requests.

If you're a hacker, and have a cool app idea, feel free to use the public mongo endpoint provided by steemdata.com

Expansion Ideas

I would love to expand this service to PostgreSQL as well as build a https://steemdata.com portal with useful utilities, statistics and charts.

Sponsored by SteemSports

A 32GB RAM, Quad-Core baremetal server that is powering SteemData has been kindly provided by SteemSports.

Don't miss out on the next post - follow me

#steem #steemd #steemit

8 years ago in #steemdata by furion (70)

$645.78

Sort:

Trending

[-]

thebatchman (61) 8 years ago

Damn this is some impressive work. Thanks for opening up an query able archive for Steemit.

$1.49

7 votes

[-]

teamsteem (74) 8 years ago

This looks like incredibly useful. This must have been a lot of work. Good job!

$1.28

3 votes

[-]

pnc (65) 8 years ago

Wow! this is great @furion. With SteemData, we could query Steem Blockchain and build Accounting App for community members or even build some tools for financial education, espically in the field of micro-financing to empower the unbanked. Congratulation.

$0.07

[-]

normalguy (51) 7 years ago

Just want to know, I was trying to access the Steemit Database using the robomongo however it always was failing to connect. Is there new connection to the database ??

$0.03

2 votes

[-]

thecryptodrive (70) 8 years ago

Well done my friend, I am proud of what you have accomplished here.

$0.02

1 vote

[-]

xeroc (70) 8 years ago

pretty impressive! good job!

$0.00

3 votes

[-]

furion (70) 8 years ago (edited)

A quick Python implementation can be seen in one of the use cases here

$0.00

4 votes

[-]

good-karma (77) 8 years ago

Great work and initiative, brother!

Is it being populated real-time?

$0.00

6 votes

[-]

andu (58) 8 years ago

I'm also interested to know how quickly the blockchain data gets added to the db as I have some apps in the pipeline that need a refresh quicker than SteemDB's 10 minute.

$0.00

3 votes

[-]

andu (58) 8 years ago

refreshing this in the browser: https://steemdata.com/stats seems to add up blocks every 5-10 seconds which is freaking awesome!

$0.00

1 vote

[-]

good-karma (77) 8 years ago

I think, there is queue for block addition it looks (15 blocks behind or so). If @furion can clarify exact numbers or way it is being populated, it would be helpful.

$0.00

[-]

furion (70) 8 years ago

Operations and new Posts are near real time, new accounts will be too in the future. Everything else is delayed. Once work sharding is in place, it should be pretty fast.

$0.00

7 votes

[-]

andu (58) 8 years ago

Blocks don't always have data/transactions so this is why it isn't a constant x blocks added each x seconds i think, but yea, let's wait for @furion

$0.00

[-]

eric-boucher (68) 8 years ago

Great accomplishment, thanks for sharing! All for one and one for all!!! Namaste :)

$0.00

1 vote

[-]

the-future (68) 8 years ago

I might not understand everything you are saying, but this is an impressive work @furion.

$0.00

1 vote

[-]

smysullivan (62) 8 years ago

Great work, thank you for your work on the project, I agree MongoDB should be extremely fast and able to handle the project with no issues.

$0.00

1 vote

[-]

barrydutton (74) 8 years ago (edited)

how do you know all this stuff lol, so much for my Steemit day off lol

$0.00

1 vote

[-]

smysullivan (62) 8 years ago

Working customer service for a large bank they run MangoDB on the back end for accounts so many cool things you can do with MangoDB.

Plus been trying to teach myself programming but not very good anymore have not been able to really work on it as of late.

$0.00

[-]

araki (63) 8 years ago

thnx for the hard work , keep it up , steemit on
upvoted , followed and resteemedYou made my day , when i see dev tool for steemit i feel great @furion even though i don't understand more than ABC at coding , the fact that steemit community actively involved in development make steemit a true decentralized blockchain .

$0.00

1 vote

[-]

avvah (50) 8 years ago

Does the preview version still work? I can't seem to be able to connect.

Host: mongo0.steemdata.com
Port: 27017

$0.00

1 vote

[-]

furion (70) 8 years ago

Please use the updated connect info from steemdata.com

$0.00

1 vote

[-]

avvah (50) 8 years ago

Ok... Thanks, I'll see if I can find that. :)

$0.00

1 vote

[-]

avvah (50) 8 years ago

Got it. Thanks so much.

$0.00

1 vote

[-]

avvah (50) 8 years ago (edited)

$0.00

1 vote

[-]

twinner (68) 8 years ago

Well done, furion. Seems that SteemData will become my favourite SteemTool soon :-)

$0.00

[-]

furion (70) 8 years ago (edited)

I'm happy to hear :)

$0.00

[-]

dragosroua (75) 8 years ago

Impressive.

What would be the time constraints of porting this into a Firebase backend? Having a Firebase backend mirroring the Steem blockchain would allow for real-time apps without the hassle of RPC calls. Having a little bit of both worlds.

$0.00

[-]

furion (70) 8 years ago

I'm not familiar with Firebase, but I guess with a little bit of coding its totally feasible.

I build real-time apps with Meteor, which uses MongoDB and its oplog, however Meteor is on decline in popularity these days.

$0.00

[-]

dragosroua (75) 8 years ago

I worked a few months with Firebase (check out http://app.zentasktic.com), it's quite similar with Mongo but much faster. It's now integrated into the Google full stack of services (analytics, push notifications, admob, etc).

$0.00

2 votes

[-]

furion (70) 8 years ago (edited)

I'm afraid its out of my scope (I don't use any Google services, willingly at least). I would prefer to stick to open-source, self-hostable solutions for data storage.

Perhaps RethinkDB could be a candidate here?

$1.23

1 vote

[-]

dragosroua (75) 8 years ago

Got it :)

$0.00

[-]

kingscrown (81) 8 years ago

damn good job!

$0.00

[-]

barrydutton (74) 8 years ago

I have no idea what you guys are talking about lol, but I am happy you are happy (:

$0.00

[-]

gutzofter (59) 8 years ago

@furion I'm on board. Tonight

$0.00

[-]

barrydutton (74) 8 years ago

No idea about all this tech stuff but it sounds nice lol--- Good job.

$0.00

[-]

maerco (51) 8 years ago

Good job!
I will try to play with it tomorrow!

$0.00

[-]

luka.skubonja (64) 8 years ago

Good job man :)

$0.00

[-]

simonjay (69) 8 years ago

I see very interesting thanks upped

$0.00

[-]

kurtbeil (64) 8 years ago (edited)

Oh my God! This is exactly what I needed! Promises almost killed me man! ;-)

Thank you @furion!

$0.00

[-]

saramiller (73) 8 years ago

Way to go, @furion!

$0.00

[-]

abudar (66) 8 years ago

very interesting

$0.00

[-]

ekaputri (63) 8 years ago

very useful your posts@furion

$0.00

[-]

tfhg (48) 8 years ago

This is good. Don't understand fully but will reread, maybe couple times :)

$0.00

[-]

lemouth (74) 8 years ago

Impressive! :)

$0.00

[-]

steemalf (57) 8 years ago

Great job!

$0.00

[-]

choreboy (51) 8 years ago

Very cool! Think I need to give you a follow.

$0.00

[-]

andu (58) 8 years ago

How would one search for transactions above a certain amount or transactions that were in SBD etc. Tried with NumberDecimal/Float and currency: "STEEM" but to no avail..@furion: what type is the amount field in the Operations / VirtualOperations table?

$0.00

[-]

furion (70) 8 years ago (edited)

The amount fields in Operations/VirtualOperations are still strings unfortunately (todo: have native types everywhere for v2).

So what you have to do is query all transactions for a time period, and then filter out the ones you need in your code.

Python Example:

from steem.amount import Amount

filter(Amount(x['amount']).currency == 'STEEM', lambda x: Amount(x['amount']).amount > 100, db_results)

$0.00

2 votes

[-]

andu (58) 8 years ago

I see, ok, Thanks man!

$0.00

[-]

nogoud5 (25) 8 years ago

Great Work Dude! Keep up on this workflow!!!

$0.00

[-]

sailendram (25) 7 years ago

Thank you for sharing about steemdb. I am trying to understand steem operation and virtual operation. Is there any reference to these operations?
what is pow2?

$0.00

[-]

mikemeister (46) 6 years ago

Looks like this is dead?

$0.00