Deepseek made headlines with its claims about its model and the cost to train it. We saw a massive shift along with volatility in the financial markets as many questioned the path Big Tech is taking.
Are they now obsolete? Not exactly.
That said, there were some engineering feats accomplished by the Deepseek team. However, the full accurateness of the claims are being disputed.
One irony that came out of this was OpenAI pushing back, stated that Deepseek was the result of distillation of ChatGPT. This is funny since OpenAI trained the early versions on nothimg more than data scraped from different websites.
Talk about the pot calling the kettle black.
Today, the practice is diminishing a bit since many sites are locking down the robots.txt, the filename associated with crawlers.
As always, this centers around, in large part, the question for data. It is an ongoing battle.
Hive Can Supplement The Distillation Process
Hive and other permissionless databases can help in this endeavor.
There are basically two choices: have data in the hands of Big Tech, i.e. those with the major platforms, or distribute the data onto networks that allow open access.
Hive is one such network. Anyone is free to set up an API and interact with the databases.
Before getting into that, let us see what was done with Deepseek with regards to distillation.
This came from VeniceAI:
Distillation in AI, often referred to as knowledge distillation, is a technique that transfers knowledge from a large, complex model (the teacher) to a smaller, more efficient model (the student). This process allows the smaller model to maintain similar performance while being easier to deploy and requiring less computational power.
Basically, ChatGPT was queried about a range of topics, providing output that was accumultated by the team. It was a process that was likely repeated millions of times, pulling in a great deal of data.
This helped to solve the problem of data. It was also probably structured in a reasoned form since the larger models all are shifting towards that.
The engineers behind Deepseek were able to create their own algorithms along with weights to train the data. Other sources could have been drawn upon and integrated. This will follow a similar reasoning format that OpenAI built, with the student mimicking the teacher.
Proprietary Data
We are still in a world of proprietary data. Even though Deepseek is open source, the data used is not being shared. This is the case for other open models such as Llama. The weights are there for all to see but the data is not moving from Meta servers.
The results of the prompts instigated by Deepseek are available to both that company and OpenAI. Nobody else has access.
Hence, one of the reasons to use distillation is to get data rapidly, even if synthetic. Reasoning models are showing that they can excel in particular areas on this type of data. It was something that was hotly debated until recently. The reasoning models are getting high marks using synthetic data in cases where there is a correct answer. This means topics such as math and physical sciences are seeing results.
Where they can fall short is in the more "artsy" areas. Here is where the human touch is still required. Therefore, the need for more than just synthetic data is crucial.
For this reason, I think that social media platforms feeding into these models have an advantage. Under this scenario, we are seeing a combination of human and synthetic. In fact, the humans are often engaging with the data.
The VeniceAI Model
VeniceAI made news last week with the release of their token. This is the first that is tied to a platform which is exclusively a chatbot. Perhaps the timing was not the best due to the fact the markets are being crushed due to the tariffs implemented by the Trump administration.
That said, we are seeing the idea of AI and crypto starting to play out. VencieAI uses its token for access. Holders can stake it which equates to VCU (Venice Computer Units), allowing for interaction with the chatbot. It is similar to the resource credit system on Hive.
Where VencieAI takes a different approach is in the fact they are privacy driven. According to the team, they do not store the data regarding the prompts. They designed it where that is stored locally on the browser. We do have to bear in mind that their word has to be taken since there is no proof this is the case.
Presuming they are being genuine, this adds a different spin. The privacy feature is valuable but it also counters the idea of more data generated. In other words, this is lost. It is understandable that not all prompts should be public. However, if we think about what most people discuss, it is not exactly high security secrets.
The spectrum of AI and data is huge. There is room for plenty of services to appear. While some might focus upon privacy, others can concentrate on public data. This is where permissionless, decentralized databases can assist.
The Democratization of Data
We are seeing a push by the AI participants to get "as many eggs in their baskets" as possible. It likely is a "mine" world.
To me, this simply favors the major players. We are not going to keep pace with Big Tech by feeding them more. Sure, OpenAI will complain when distilation is done using their model but that changes nothing. Even Deepseek is still 7 or 8 months, behind according to many estimates.
The compute problem is tough to overcome. This is compounded by the data issue. Bascially, the models improve as more compute (and data) is put forth. A 10x in compute is going to have a major impact, albeit not on a 1:1 ratio. Under this circumstance, perhaps a 3x in the model is realized.
It is a back and forth race.
We have the battle for more, both compute and data, Then we have innovation, with each level of the stack drawing attions. Algorithms are improving and capabilities expanding that make existing data, synthetic or otherwise, more valuable.
And then we have the expected move into embedded AI, where large volumes of data will be acquired through the sensors on cars, robots, and other devices that move through our environment.
Here again, we have something that will likely benefit Big Tech since it takes a lot to get into the robotics game.
The winner take most proposition could be spreading to the real world. That is something that we must fight, in all ways possible.
Posted Using INLEO
A future with decentralized systems that prioritize permissionless access might be the key to democratizing data, enabling a more level playing field that encourages creativity and competition. It's a crucial conversation as we navigate these challenges together.
"Distillation" ahhh...so thats what i was doing whenever i asked a model how it would explain any of its function/processes to another model it was "mentoring" (i used that language just in case its programmed to not want to share certain things). Makes sense now and it wouldnt be difficult to do on a large scale. Could just have the inferior model interview the other
HIVE lacks enough authoritative, original content to be the main training dataset for an LLM. It would give absolutely wild takes on so much content as much of the content posted has little to no basis in reality.
There would also be an over-abundance of information about HIVE itself, which isn't useful as a general model.
In all of my years on HIVE ( and formerly Steem), there isn't enough critical discourse here, as people see the numbers near their posts as promises they have to defend, and to be careful about stepping on the toes of others who could reduce that number to zero.
Excellent analysis, Task!!
I have a doubt, regarding the "VeniceAI" privacy approach: do they avoid the prompts only to be stored on their servers (so the precise part of the interaction with an AI model that can reveal the user's interests, his spelling, specific personal expressions, etc. and link them to his identity if his IP address is findable), or do they discard even "VeniceAI"'s answers to those prompts?
The first option seems to be the most fertile one, as the models go on learning through the iterations and interactions with human beings (and bots) directing questions to them, while at the same time getting rid of most of the privacy sensitive metadata.
Talk about the pot calling the kettle black. Hahaha, you just caught my attention at the early paragraphs. OpenAI shouldn't be the ones laying such complaints or maybe they are also looking for data compensations as they've always been charged, Lols.
Documenting and analyzing data via blockchain enhances transparency and allows companies to improve their operations more accurately.
In addition to the most important thing for everyone, which is reducing costs.
I know this sounds weird, but I'd look into what Hindermburg Melão Jr has so say about AI and LLM models... Dude is just a math genius, literally 230 IQ, one of the smartest person alive, and made one of the best trading bots on EUR/USD, but too many articles on how to backtest that... Between many topics on YouTube like astronomy, math, science and so on... Is also AI...
Basically this dude is able to correct some Nobel prizes articles and says LLM increases data download of the internet, which most contains errors, so it doesn't improve heuristic that much, but still possible through algo... The idea is that someone with over 180 IQ (above chatgpt) and technical knowledge in some specific area could still write a better article than AI, cause this person would also be able to write it better than, what, almost 8 billion people? (For comparison).
A lot of the talk goes on statistics, where normal distributions are used and sigma standard deviations from the average e.g.. The guy is a really savant, world record of blinded chess...
!BBH