Reddit Self-reported Depression Diagnosis (RSDD) dataset

The RSDD (Reddit Self-reported Depression Diagnosis) dataset consists of Reddit posts for approximately 9,000 users who have claimed to have been diagnosed with depression ("diagnosed users") and approximately 107,000 matched control users. All posts made to mental health-related subreddits or containing keywords related to depression were removed from the diagnosed users' data; control users' data do not contain such posts due to the selection process.

Further dataset construction details are available in Section 3.1 of the EMNLP 2017 paper Depression and Self-Harm Risk Assessment in Online Forums or on the data website.

Information on obtaining this dataset can be found here.

Code for reproducing the baseline methods can be found here.


  author    = {Yates, Andrew  and  Cohan, Arman  and  Goharian, Nazli},
  title     = {Depression and Self-Harm Risk Assessment in Online Forums},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year      = {2017},
  publisher = {Association for Computational Linguistics},
  pages     = {2958--2968},
  url       = {}

Contact Information

For any comments or questions, please email Andrew or Arman.