We collected human serine ADP-ribosylation sites by experimentally validated data from two previously published articles [1, 2]. In order to enable the model to recognize the characteristics of S-ADPr sites, we built a balanced data set. The process was described as follows.(1) 2,158 proteins containing 6,573 S-ADPr sites were clustered using CD-HIT with an identification threshold of 40% to remove similar protein sequences.(2) 1764 proteins were retained as cluster representatives of which each contained the most number of S-ADPr sites in the related cluster. Therefore, 5464 S-ADPr sites in these representatives were considered as positive sites and the remaining serine sites were taken as negative sites. To balance the data of positive and negative samples, 5464 negative sites was randomly chosen as the final negative samples. (3) Both positive samples and negative samples were separated equally into five groups, four of which served as the cross-validation data set (ie. 4,371 positive and 4371 negative sites) and the remaining data were used as the independent test sets (i.e.1093 positive and 1093 negative sites).

[1] Hendriks I, Larsen S, Nielsen M. An Advanced Strategy for Comprehensive Profiling of ADP-ribosylation Sites Using Mass Spectrometry-based Proteomics. Molecular & cellular proteomics : MCP. 2019; 18: 1010-26.
[2] Larsen S, Hendriks I, Lyon D, Jensen L, Nielsen M. Systems-wide Analysis of Serine ADP-Ribosylation Reveals Widespread Occurrence and Site-Specific Overlap with Phosphorylation. Cell reports. 2018; 24: 2493-505.e4.

Who are using?