Imputation of ungenotyped individuals based on genotyped relatives using Machine Learning Methodology

Document Type : Original Article


1 Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran

2 a Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran

3 Department of Animal Science, Shahr-e-Qods Branch, Islamic Azad University, Tehran, Iran


Machine learning methods have been used in genetic studies to build models capable of predicting missing genotypes for both human and animal genetic variations. Genotype imputation is an important process of predicting unknown genotypes. The objective of this study was to investigate the idea of using machine learning as imputation to compare the family-based methods and tried to offer improving the imputation performance in different scenarios. Also, the accuracies of different methods i.e. Support vector Machine; SVM, Random forest; RF are compared. The final population were simulated in the form of different family structures. Therefore, 100 families including one sire with different number of genotyped progenies (2, 3, 4, 5 or 7) were simulated. The number of markers was set to 5000 for whole genome. The sires in families and other scenarios such as, BothParents, sire/dam and one progeny, sire and maternal grandsire were defined to investigate the ability of learning machine algorithm for imputation. The imputation accuracy ranged from 0.78 to 0.99 in different scenarios. Also, least amount of imputation accuracy were achieved for sire and maternal grand sire scenario with both methods. Increasing in number of progenies from 2 to 3 was considerably increased in imputation accuracy (SVM and RF). The imputation of non-genotyped individuals based on parent-offspring trios and close relatives paired is possible. But, the use of child- one parent genotyped, BothParents genotyped and sire and maternal grandsire genotyped, average imputation accuracy would not exceed 85%. While genotyped progenies are the best source of predicted genotypes for ungenotyped individuals and if the number of progeny is more than 4, the imputation accuracy is increased more than 95%. These results confirmed, that the performance of machine learning methods in family of trios has a good accuracy and computational speed, which can be used in estimated breeding value.


Bai, W.Y., Zhu, X.W., Cong, P.K., Zhang, X.J., Richards, J.B., Zheng, H.F. (2019). Genotype imputation and reference panel: a systematic evaluation on haplotype size and diversity. Briefings in Bioinformatics. 6, bbz108. Doi: 10.1093/bib/bbz108.
Bernardes, P.A., do Nascimento, G.B., Savegnago, R.P., Buzanskas, M.E., Watanabe, RN., de Almeida Regitano, L.C., and et al. (2019). Evaluation of imputation accuracy using the combination of two high-density panels in Nelore beef cattle. Scientific Reports. 9 (17920).
Berry, D.P., McHugh, N., Randles, S., Wall E, McDermott K, Sargolzaei M, and et al. (2018). Imputation of non-genotyped sheep from the genotypes of their mates and resulting progeny. Animal. 12(2), 191-198.
Berry, D.P., McParland, S., Kearney, J.F., Sargolzaei, M., Mullen, M.P. (2014). Imputation of ungenotyped parental genotypes in dairy and beef cattle from progeny genotypes. Animal. 8(6), 895-903.
Berry, D.P., O'Brien, A., Wall, E., McDermott, K., Randles, S., Flynn, P., and et al. (2016). Inter- and intra-reproducibility of genotypes from sheep technical replicates on Illumina and Affymetrix platforms. Genetics Selection Evolution. 48(1), 86.
Boichard, D., Chung, H., Dassonneville, R., David, X., Eggen, A., Fritz, S., and et al. (2012). Design of a bovine low-density SNP array optimized for imputation. PLoS One. 7(3).
Boison, S.A., Neves, H.H.R., Pérez O’Brien, A.M., Utsunomiya, Y.T., Carvalheiro, R., da Silva, M.V.G.B., and et al. (2014). Imputation of nongenotyped individuals using genotyped progeny in Nellore, a Bos indicus cattle breed. Livestock Science. 166,176-189.
Boison, S.A., Santos, D.J.A., sunomiya, A.H.T.U., Carvalheiro, R., Neves, H.H.R., PerezO’Brien, A.M., and et al. (2015). Strategies for single nucleotide polymorphism (SNP) genotyping to enhance genotype imputation in Gyr (Bos indicus) dairy cattle: Comparison of commercially available SNP chips. Journal of Dairy Science. 98 (7), 4969-4989.
Bouwman, A.C., Hickey, J. M., Calus M.P.L., Veerkamp R.F. (2014). Imputation of non-genotyped individuals based on genotyped relatives: assessing the imputation accuracy of a real case scenario in dairy cattle. Genetics Selection Evolution. 46 (1), 6.
Butty, A.M., Sargolzaei, M., Miglior, F., Stothard, P., Schenkel, F.S., Gredler-Grandl, B., and et al. (2019). Optimizing selection of the reference population for genotype imputation from array to sequence variants. Frontiers in Genetics. 10(510).
Chen, J., Zhang, J.G., Li, J., Pei, Y.F., Deng, H.W. (2013).On combining reference data to improve imputation accuracy. PLoS One. 8(1).
Dassonneville, R., Fritz, S., Ducrocq, V., Boichard, D. (2012). Short communication: Imputation performances of 3 low-density marker panels in beef and dairy cattle. Journal of Dairy Science. 95(7), 4136-40.
Datta, A.S., Lin, S., Biswas, S. (2019). A Family-Based Rare Haplotype Association Method for Quantitative Traits. Human Heredity. 83(4), 175-195.
Erbe, M., Hayes, B.J., Matukumalli, L.K., Goswami, S., Bowman, P.J., Reich, C.M., and et al. (2012). Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. Journal of Dairy Science. 95, 4114–4129.
Grossi, D. A., Brito, L. F., Jafarikia, M., Schenkel, F. S., Feng. Z. (2018). Genotype imputation from various low-density SNP panels and its impact on accuracy of genomic breeding values in pigs. Animal. 12, 2235–2245.
Heidaritabar, M., Calus, M.P. L., Addie Vereijken, A., Groenen, M.A.M. Bastiaansen, J.W.M. (2015). Accuracy of imputation using the most common sires as reference population in layer chickens. BMC Genetics. 16 (101).
Herry, F., Hérault, F., Picard Druet, D., Varenne, A., Burlot, T., Le Roy, P., and et al. (2018). Design of low density SNP chips for genotype imputation in layer chicken. BMC Genetics. 19,108.
Ionita-Laza, L., Lee, S., Makarov, V., Buxbaum, J. D., Lin, X. (2013). Family-based association tests for sequence data, and comparisons with population-based association tests. European Journal of Human Genetics. 21(10), 1158- 1162.
Johnson, N. (2008). Genotype Prediction with SVMs. proj2008/Johnson-GenotypePredictionWithSVMs.pdf.
Kamaei, M., Honarvar, M., Aminafshar, M.,Abdollahi-Arpanahi, R. (2017). Imputation of parentoffspring trios and their effect on accuracy of genomic prediction using Bayesian method. Journal of Livestock Science and Technologies. 5(2). 43-52.
Liesbeth van der Waaij, K. O. (2014). Textbook. Animal breeding and genetics for BSc students. Centre for Genetic Resources and Animal Breeding and Genomics Group, Wageningen University and Research Centre. [Updated November 22, 2015; accessed September 1, 2019].
Liu, C. T., Deng, X., Fisher, V., Heard-Costa, N., Xu, H., Zhou, Y., and et al. (2019). Revisit Populationbased and Family-based Genotype Imputation. Scientific Report. 9 (1800).
Lu, A.T., Cantor, R.M. (2014). Identifying rare-variant associations in parent-child trios using a Gaussian support vector machine. BMC Proceedings. 8 (Suppl 1), S98.
Ma, P., Brøndum, R.F., Zhang, Q., Lund, M.S., Su, G. (2013). Comparison of different methods for imputing genome-wide marker genotypes in Swedish and Finnish Red Cattle. Journal of Dairy Science. 96 (7), 4666-77.
Mikhchi, A., Honarvar, M., Emam Jomeh Kashan, N., Zerehdaran, S., Aminafshar, M. (2016). Comparison of three boosting methods in parent-offspring trios for genotype imputation using simulation study. Journal of Animal Science and Technology. 58 (1).
Mrode, R., Ojango, J.M.K., Okeyo, A.M., Mwacharo, J.M. (2019). Genomic Selection and Use of Molecular Tools in Breeding Programs for Indigenous and Crossbred Cattle in Developing Countries: Current Status and Future Prospects. Frontiers in Genetics. 9 (694).
Nicolazzi, E.L., Biffani, S., Jansen, G. (2013). Short communication: imputing genotypes using PedImpute fast algorithm combining pedigree and population information. Journal of Dairy Science. 96(4), 2649-2653.
Ogutu, J.O., Piepho, H.P., Streeck, T.S. (2011). A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proceedings. 5(Suppl3), S11.
Piccoli, M.L., Braccini, J., Cardoso, F.F., Sargolzaei, M., Larmer, S.G., Schenkel, F.S. (2014). Accuracy of genome-wide imputation in Braford and Hereford beef cattle. BMC Genetics. 29, 15:157.
R Development Core Team. (2014). R: a language and environment for statistical computing, Vienna.
Roberts, A., Millan, L.M.C., Wang, W., Parker, J., Rusyn, I., Threadgill, D. (2007). Inferring missing genotypes in large SNP panels using fast nearestneighbor searches over sliding windows. 23 i401–i407 Bioinformatics.
Samorè, A. B., Fontanesi. L. (2016). Genomic selection in pigs: state of the art and perspectives. Italian Journal of Animal Science. 15, 211–232.
Sargolzaei, M., Jansen, G.B., Schenkel, F.S. (2014). A new approach for efficient genotype imputation using information from relatives. BMC Genomics. 15(1), 478.
Shashkova, T.I., Martynova, E.U., Ayupova, A.F., Shumskiy, A.A., Ogurtsova, P.A., Kostyunina, O.V., and et al. (2020). Development of a low-density panel for genomic selection of pigs in Russia. Translational Animal Science. 4(1), 264–274.
Shi, S., Yuan, N., Yang, M., Du, Z., Wang, J., Sheng, X., and et al. (2018). Comprehensive Assessment of Genotype Imputation Performance. Human Heredity. 83(3), 107-116.
Taylor, J.F. Schnabel, R.D. Sutovsky P. (2018). Genomics of Bull Fertility. Animal. 12 (Suppl1), s172–s183.
Technow, A.F. (2015). Hypred: Simulation of Genomic Data in Applied Genetics. R package version 0.5. Available at: ⟨http://CRAN.Rproject.
org//web/packages/ hypred⟩.
Ullah, E., Mall, R., Abbas, M.M., Kunji, K., Nato Jr, A. Q., Bensmail, H., and et al. (2019). Comparison and assessment of family- and population-based genotype imputation methods in large pedigrees. Genome Research. 29 (1), 125-134.
Wang, X., Zhang, Z., Morris, N., Cai, T., Lee, S., Wang, C., and et al. (2017). Rare variant association test in family-based sequencing studies. Briefings in Bioinformatics. 18 (6), 954-961.
Willer, C.J., Li, Y., Abecasis, G. R. (2010). METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 26 (17), 2190–2191.