Releasing the largest multilingual open pretraining dataset
This release marks the largest open and permissively licensed dataset for language model training. It has substantial multilingual representation.
This release marks the largest open and permissively licensed dataset for language model training. It has substantial multilingual representation.