Back to AI is open source
Chinese Internet corpus resource platform open source data

Chinese Internet corpus resource platform open source data

AI is open source Admin 4 views

1. Platform background

Launched by the China Cyberspace Security Association and the National Internet Emergency Response Center, it aims to provide high-quality and reliable Chinese Internet corpus resources to support artificial intelligence model training, natural language processing research and other applications.


2. Resource characteristics

The

platform has launched "Chinese Internet Basic Corpus 2.0", covering 27 datasets with a total volume of about 2.7TB, of which the basic corpus part is about 120GB, containing about 38 million pieces of data. All data is source-verified, content filtered, and deduplicated to ensure the accuracy and reliability of the content.


3. Open source value

After registration and certification, it can be downloaded and used to meet various needs such as scientific research and industry, promote the development of open source ecology, and promote the innovation and application of large models and natural language processing technology in the Chinese field.


For details, please refer to the official website:

https://corpus.cybersac.cn/?home#/index

Recommended Tools

More