The Pile

    An 800GB Dataset of Diverse Text for Language Modeling

    What is the Pile?

    The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

    Download

    The Pile is hosted by the Eye.

    The format of the Pile is jsonlines data compressed using zstandard.

    Have a model that uses or evaluates on the Pile? Let us know!

    Why is the Pile a good training set?

    Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.

    Why is the Pile a good benchmark?

    To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for large language models.

    Citing

    If you use the Pile or any of the components, please cite us!

    @article{pile,
      title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
      author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
      journal={arXiv preprint arXiv:2101.00027},
      year={2020}
    }
                    

    Leaderboard

    * indicates potential test-set overlap. Zero-shot indicates that not all of the components of the Pile were present in the training data.

    Rank Model Test BPB

    1.

    Jan 1.2021

    GPT-3 (Zero-Shot)*

    OpenAI

    0.7177

    2.

    Jan 1.2021

    GPT-2 (Zero-Shot)*

    OpenAI

    1.2253

    主站蜘蛛池模板: 国精品无码一区二区三区在线蜜臀| 国内精品一区二区三区东京| 国产综合精品一区二区| 国产精品区AV一区二区| 国产精品久久久久久一区二区三区| 亚洲线精品一区二区三区 | 在线播放偷拍一区精品| 大伊香蕉精品一区视频在线| 精品一区二区三区视频在线观看 | 国产主播一区二区| 国产在线观看一区二区三区| 91国偷自产一区二区三区| 亚洲视频一区调教| 亚洲av无码一区二区三区乱子伦| 国产一区二区精品久久凹凸| 国产韩国精品一区二区三区| 色综合视频一区二区三区| 国产一区二区精品在线观看| 日本人的色道www免费一区| 亚洲AV无码一区二区乱子仑| 国产精品一区二区不卡| 精品黑人一区二区三区| 国产在线乱子伦一区二区| 亚洲视频在线一区二区三区| 在线观看国产一区二三区| 国产手机精品一区二区| 国产精品乱码一区二区三| 性色av闺蜜一区二区三区| 国产亚洲福利精品一区二区 | 少妇无码一区二区二三区| 国产一区二区三区久久精品| 国产伦理一区二区| 无码人妻精品一区二区三区东京热| 无码aⅴ精品一区二区三区浪潮| 香蕉久久一区二区不卡无毒影院 | 手机福利视频一区二区| 女同一区二区在线观看| 女人和拘做受全程看视频日本综合a一区二区视频 | 国产一区二区三区高清视频| 国产一区二区福利| 国产91久久精品一区二区|