Hi Garet, I was looking for some NLP tools but I heard about Docker, thanks for the intro :)) Could you please recommend me some beginner tools for NLP? I am trying to make sentences out of YouTube's automatic captioning system to make transcripts. GNU Sed is just not enough anymore :D
You are viewing a single comment's thread from:
NLTK
OK time to finally learn the Python then. I loved OOP in Objective C!
Hmm question not answered:
Stackoverflow, how to add punctuation
That's a VERY complex problem, no easy answer - you could possibly pull it off by creating a ruleset yourself though.
Basically the rules for when to start a new sentence can be defined in terms of what comes before and after the full stop - so write that ruleset and iterate through the words.
Yeah I basically took the auto-captions of YouTube I had already cleaned up for difficult words like "grid coin" and BOINC, as a vtt subtitle file.
I got rid of the vtt timecodes by GNU tools like
sed
.I then loaded up the vtt in TextEdit and cmd-F to highlight words. I noticed that @CM-Steem aka customminer uses stop words like "So" a lot so I put periods before those.
https://steemit.com/gridcoin/@nutela/gridcoin-whaletank-rough-transcript-friday-8th-aug-2017
Here's the video:
I edited upto 15 mins or so.
You wouldn't believe how much text one can fill be simply talking for 15 minutes. Way too much work to do by hand.
You could try to make use of the natural pauses in speech to add the full stops as well.
Hey that's a great idea! I wonder though how to get that, I was wondering if YouTube would offer any insight but their tool is closed off. IBM Whatson looks much cooler and even has a github link but I'm not so sure of the quality. It couldn't keep up when testing real time (with Loopback) but then again real time is maybe too much to ask.
Full post with plenty of images