Data. Data. Data.
It is the new oil. There is a gold rush on and data is at the core of it. The desire for AI systems around the world will keep growing. To achieve that end, data is required.
Not only is more required, we have to question who has access to it. There is a race and people (companies) are looking to keep their eggs.
It is simply the state we are in. Corporations operate in their own self interest. The stakes are high when it comes to the potential. AI is going to transform humanity. We are closing in on a realm where metrics are no longer measured in billions. Instead, trillions will be the norm.
How do we counteract this? To no surprise, it comes back to data.
Image generated by Ideogram
Harvard and Google Opening Up 1 Million Books
Harvard is actually taking a step to aid in this effort. In conjunction with Google, which scanned many books, it will be releasing 1 million non-copyright books.
Before getting to that, we have to do a bit of math. According Venice.ai, a book contains the following tokens:
A token is typically defined as a single unit of text, such as a word or a character. Let's consider a few examples:
- A typical novel has around 50,000 to 100,000 words, which translates to approximately 250,000 to 500,000 tokens (assuming an average of 5 tokens per word).
- A non-fiction book with a length of 200 pages, assuming 250 words per page, would have around 50,000 tokens.
- A children's book with 32 pages, assuming 100 words per page, would have around 3,200 tokens.
For sake of discussion, we will use 100K as the number of tokens for a book.
The initiative between Google and Harvard is to release 1 million books from the likes of Dante and Shakespeare.
AI training data has a big price tag, one best-suited for deep-pocketed tech firms. This is why Harvard University plans to release a dataset that includes in the region of 1 million public-domain books, spanning genres, languages, and authors including Dickens, Dante, and Shakespeare, which are no longer copyright-protected due to their age.
The new dataset isn’t available yet, and it’s not clear when or how it will be released. However, it contains books derived from Google’s longstanding book-scanning project, Google Books, and thus Google will be involved in releasing “this treasure trove far and wide.”
Using the 100K tokens per book, this means we are about to see a release, to the public, of 100 billion tokens.
This is a terrific initiative yet it stresses the need for public data.
To quantify this, Llama3 was trained on 16 trillion tokens. That is 160 times what Harvard is releasing, i.e. the entire database of non-copyright books throughout human history.
The Importance of the Internet and Web 3.0
Let us frame this another way.
According to Venice.ai, the average article comes in around 4,000 tokens. If we look at Medium, that site does around 50K articles each day.
Doing some simple math, Medium is generating roughly 200 million tokens per day. That means it will match the Harvard book release in around 16 months.
Of course, while Medium could be scraped, this is not exactly public data. We see many sites starting to lock things down, including the major social media platforms.
To me, here is where blockchain enters. The need to generate public data is crucial. The training amount for Llama3 will be dwarfed by Llama4, which will top at least 50 trillion tokens.
These AI models are hungry.
The Harvard book release is a big deal. Yet, as the numbers show, it is a drop in the bucket. While getting 100 billion tokens out into the public is a major step, it is nothing compared to what is needed.
Social media and other Internet activities are generating larger amount. However, it is becoming clear there is nothing public about this data.
Blockchain answers this. By now, most regular readers are aware of how important a text database like Hive is. Each day, millions of tokens are added through on-chain activity. This pales in comparison to Medium but does exemplify the potential.
Feeding Everyone
Does this potentially feed Big Tech?
Of course. There will come a time when they are swallowing up what data is found on blockchains. However, this is equal opportunity data. Anyone is free to set up an API and gather what is on-chain.
Going one step further, to truly develop an AI system, information on a scale most cannot imagine is required. Computers excel at recognizing trends. It is why data science became such an important field. Companies all over the world depend upon pattern recognition.
Ultimately, it is going to come down, in large part, to the one with the largest database.
Naturally, this favors Big Tech. That said, if humanity starts to realize what is happening, it can alter the path we are on. By focusing upon data that is open, the number of tokens generated each day grows. This is compounded by the idea of building AI agents on blockchains, adding a force multiplier to the entire equation.
Data. Compute. Energy.
These are the commodities of the future. We are looking at an expanding digital world. Networks are the new real estate boom. Filling the databases that are tied to these open systems is crucial. Compute and energy are nothing without data.
Basically, we are heading towards a power struggle. The largest entities in the world want to control this technology. Sam Altman clearly stated this. Through this process, he also wants to capture much of the value generated (his goal is to have OpenAi as a $100 Trillion company).
That is a lot of value in the hands of very few people.
The way to combat this is clear. OpenAI is scouring the Internet for data. It is having to enter into agreement costs tens of millions of dollars. This is something the start up world cannot do.
Democratized data is the solution. We can see how, when breaking it down, it is nothing more than a numbers game. That should come as no surprise since that is what computers run on. Tokens are more than cryptocurrency. They are the basic metric for the future.
How many tokens are you generated each day and who are you giving them to? Is it Elon Musk, Sam Altman, or Mark Zuckerberg? Or are you providing them to humanity?
This is a choice you make each time you hit the enter button when online.
Posted Using InLeo Alpha
I'd be interested in having access to all this data. In the meantime, I'll still be using the Gutenberg Project - which is, of course Web 2.0 - to access public domain books.
Data is indeed very important !BBH !LOLZ
lolztoken.com
But at least it puts food on the table.
Credit: reddit
@taskmaster4450, I sent you an $LOLZ on behalf of day1001
(3/8)
Farm LOLZ tokens when you Delegate Hive or Hive Tokens.
Click to delegate: 10 - 20 - 50 - 100 HP
@taskmaster4450! @day1001 likes your content! so I just sent 1 BBH to your account on behalf of @day1001. (11/20)
(html comment removed: )