The Pile

    An 800GB Dataset of Diverse Text for Language Modeling

    What is the Pile?

    The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

    Download

    The Pile is hosted by the Eye.

    The format of the Pile is jsonlines data compressed using zstandard.

    Have a model that uses or evaluates on the Pile? Let us know!

    Why is the Pile a good training set?

    Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.

    Why is the Pile a good benchmark?

    To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for large language models.

    Citing

    If you use the Pile or any of the components, please cite us!

    @article{pile,
      title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
      author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
      journal={arXiv preprint arXiv:2101.00027},
      year={2020}
    }
                    

    Leaderboard

    * indicates potential test-set overlap. Zero-shot indicates that not all of the components of the Pile were present in the training data.

    Rank Model Test BPB

    1.

    Jan 1.2021

    GPT-3 (Zero-Shot)*

    OpenAI

    0.7177

    2.

    Jan 1.2021

    GPT-2 (Zero-Shot)*

    OpenAI

    1.2253

    主站蜘蛛池模板: 精品久久一区二区三区| 亚洲AV无码一区二区乱孑伦AS| 97一区二区三区四区久久| 中文字幕国产一区| 国产无吗一区二区三区在线欢 | 久久精品国产亚洲一区二区| 精品女同一区二区三区免费播放| 交换国产精品视频一区| 日韩高清国产一区在线| 精品一区二区三区在线播放视频| 中文字幕VA一区二区三区| 精品国产不卡一区二区三区| 夜夜高潮夜夜爽夜夜爱爱一区| 一本一道波多野结衣AV一区| 中文字幕乱码人妻一区二区三区 | 日韩一区二区电影| 中文字幕无码一区二区免费| 国产一区二区三区夜色| 国产午夜精品一区二区三区小说| 一区二区三区91| 国产一区二区高清在线播放| 国产伦精品一区二区三区视频猫咪| 乱人伦一区二区三区| 国产精品女同一区二区久久| 国产午夜精品一区二区三区小说 | 国产在线不卡一区二区三区| 国产一区二区在线观看app| 熟女少妇丰满一区二区| 国产成人精品视频一区二区不卡| 无码少妇精品一区二区免费动态| 亚洲国产成人一区二区三区| 精品无码一区二区三区在线 | 国产高清一区二区三区四区| 中文字幕一区二区在线播放| 无码免费一区二区三区免费播放| 人妻少妇精品视频一区二区三区| 国产91久久精品一区二区| 成人免费一区二区无码视频| 偷拍精品视频一区二区三区| 中文字幕在线无码一区| 精品一区高潮喷吹在线播放|