Reddit Self-reported Depression Diagnosis (RSDD) dataset

The RSDD (Reddit Self-reported Depression Diagnosis) dataset consists of Reddit posts for approximately 9,000 users who have claimed to have been diagnosed with depression ("diagnosed users") and approximately 107,000 matched control users. All posts made to mental health-related subreddits or containing keywords related to depression were removed from the diagnosed users' data; control users' data do not contain such posts due to the selection process.

Further dataset construction details are available below and in Section 3.1 of the EMNLP 2017 paper Depression and Self-Harm Risk Assessment in Online Forums or on the data website.

Obtaining the data

The RSDD dataset contains only publicly available Reddit posts. Posts may contain information related to users' health, however, and are thus sensitive. To protect users' privacy, researchers who wish to obtain the dataset must sign a data usage agreement.

Succinctly, the agreement requires that researchers

Researchers interested in obtaining the RSDD dataset may submit a data request form to be provided with the data usage agreement and further information on obtaining the data.


  author    = {Yates, Andrew  and  Cohan, Arman  and  Goharian, Nazli},
  title     = {Depression and Self-Harm Risk Assessment in Online Forums},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year      = {2017},
  publisher = {Association for Computational Linguistics},
  pages     = {2958--2968},
  url       = {}

Contact Information

For any comments or questions, please email Andrew or Arman.