Self-Reported Mental Health Diagnoses (SMHD) dataset

The SMHD (Self-Reported Mental Health Diagnoses) dataset consists of Reddit posts of users who have claimed to have been diagnosed with one or several of nine mental health conditions ("diagnosed users"), and matched control users. All posts made to mental health-related subreddits or containing keywords related to a mental health condition were removed from the diagnosed users' data; control users' data do not contain such posts due to the selection process.

SMHD contains nine mental health conditions with diagnosed users.

Condition Users Posts
ADHD 10,098 872K
Anxiety 8,783 795K
Autism 2,911 248K
Bipolar 6,434 575K
Depression 14,139 1,272K
Eating 598 53K
OCD 2,336 203K
PTSD 2,894 258K
Schizophrenia 1,331 123K
Control 335,952 116M

Further dataset construction details are available in Section 3 of the COLING 2018 paper SMHD: A Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions.

Information on obtaining this dataset can be found here.


Citation


  @InProceedings{cohan2018smhd,
  author    = {Cohan, Arman and Desmet, Bart and Yates, Andrew and Soldaini, Luca and MacAvaney, Sean and Goharian, Nazli},
  title     = {SMHD: A Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions},
  booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING)},
  year      = {2018},
  publisher = {Association for Computational Linguistics},
  pages     = {1485–-1497},
  url       = {https://www.aclweb.org/anthology/C18-1126}
  }

Contact Information

For any comments or questions, please email Arman, Bart or Andrew.