Introduction


The construction process of the model is as follows: 1. Data processing. A. Generation of sequence data. In this experiment, the corresponding sequence data is extracted from the original genomic coordinate data (the original data only contains genomic coordinate data), and both sequence data and genomic data are used to recognize methylation sites. B. Set of sample weight. Since the positive sample of each training set is composed of 5 single-base resolutions, different single-base resolutions have different labels for the same site. Therefore, different weights (2, 3, 4, 5 for positive samples and 1 for negative samples) are set in each sample according to the performance of different single-base resolutions in the same site. C. Generation of negative samples. Considering that the number of non-methylated sites is much larger than methylated sites on chromosomes, negative samples are selected in the experiment (positive: negative = 1:10). In order to ensure better prediction performance and stronger generalization ability of the model, all negative samples are collected for clustering by using GMM, which are clustered into 5 classes and Negative samples with the same number of positive samples are sampled to the same extent in 5 classes. 2. Feature extraction. Sequence-based features, physicochemical properties and gene-derived features were extracted respectively, which are NCP, CKSNAP, DNC, Mismatch, PC-PseDNC, PC-PseTNC, SC-PseDNC, SC-PseTNC and 14 gene feature extraction methods. 3. Feature selection and feature stitching. All features except NCP are selected using MRMD, which are spliced together to produce the final feature vector. Compared with training for a single feature, the feature selection and splicing can significantly improve the performance of the model in principle. In this paper, three kinds of feature extraction methods are used for feature extraction. 4. The training model using Xgboost. In view of the weight information of samples and the advanced classification ability, Xgboost is considered to be a suitable classification algorithm. In this process, 5-fold cross-validation is used for model training and construction. Independent datasets 1 and 2 were used to further demonstrate the classification capacity and the generalization ability of the model. Since Xgboost involves the adjustment of multiple parameters, see table S3 for the specific parameter Settings.

HSM6AP can identify M6A site, accurately.The supported file formats are :
  • Fasta