Properly parse a steemit discussion body for Mixion

in #utopian-io7 years ago (edited)

Details

  • If you have tried MiXion you would have noticed that some Discussion display fine while others will still contain links and Markdown fragments.

  • The problem here is that not all posts are the same, some are pure HTML, some are pure Markdown, some are plain text, and some are a combination of Html/Markdown.

I already have methods of parsing each type of content, Bypass for markdown and the stock Html.fromHtml in Android but I need a proper way to determine what exactly is in each discussion and how to handle those that are a combination.

Task 1

  • Write a method or class in Kotlin or Java that determines what exactly is in the discussion and parses it accordingly.
    You can play with the API and see all the types of results that we get. I found that the format tag in the Json metadata is unreliable.

Some discussion are easy to determine what they are based on where they are from, for example, all posts from dmania begin with <center>\n <a href=\"https://dmania.lol and they are all in Html, all posts from dtube begin with <center><a href='https://d.tube and all posts from utopian.io are in Markdown.

Task 2

  • Write a method or class that extracts all human readable text only, this wil be used for the feed summary.

I've already wrote this function to strip out all html tags, we need one to strip out markdown and plain links. Maybe you could combine them both.

fun stripHtmlTags(html: String): String {

    val sbText = StringBuilder()
    val sbHtml = StringBuilder()

    var isText = true

    for (ch in html.toCharArray()) {
        if (isText) { // outside html
            if (ch != '<') {
                sbText.append(ch)
                continue
            } else {   // switch mode
                isText = false
                sbHtml.append(ch)
                continue
            }
        } else { // inside html
            if (ch != '>') {
                sbHtml.append(ch)
                continue
            } else {      // switch mode
                isText = true
                sbHtml.append(ch)
                continue
            }
        }
    }

    return sbText.toString()
}

Current work

You can find some of the work I've already done on this problem in StringUtils.java StringExt.kt and the entire steemitutils directory

Communication

You can reach me through discord at edTheGuy00

All proceeds of this post will go to whoever complete a task i guess



Posted on Utopian.io - Rewarding Open Source Contributors

Sort:  

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]

Hey @edgar-trem I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • Seems like you contribute quite often. AMAZING!

Suggestions

  • Contribute more often to get higher and higher rewards. I wish to see you often!
  • Work on your followers to increase the votes/rewards. I follow what humans do and my vote is mainly based on that. Good luck!

Get Noticed!

  • Did you know project owners can manually vote with their own voting power or by voting power delegated to their projects? Ask the project owner to review your contributions!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

Technically Markdown is supposed to support you adding HTML in.

So in theory, depending on your markdown parser, you should just be able to feed everything through a markdown parser and get out the valid HTML, even if the content contains HTML itself.