Abstract:Objective:To estimate polymorphism in CagA protein systematically and analyze its characteristics by data miningusing SAS 9.0 soft package. Methods:CagA protein sequences were searched in protein databases in NCBI, swiss_prot/tremble and DDBJ. Data modification and program were performed with tools provided by SAS soft package. Data warehouse of CagA protein sequences were then built. These sequences were statistically analyzed by BioEdit7.0, SAS9.0 and student T test. Results:Redundant sequences were found and deleted. Complete CagA protein sequences of 97 strains and 3’ variable regions sequences of 268 Helicobacter pylori strains were obtained. Repeats in variable region were arranged in order in datasheets of SAS programs and observed clearly. The distinct position of variable region in CagA sequence was settled. Exact mean of number,polymorphisms and its distributing probability of EPIYA motif repeats were obtained. Interval sequences between two EPIYA motif repeats,knowledge of its kinds, characters and frequency were also obtained. The average of number of amino acids in variable regions is 115.76 ± 27.38 in 365 strains,which was the dominating causation that induced diversity of CagA proteins in length. EPIYA motifs repeated 3.28 ± 0.72 times in average,seven times in the maximum and one time in minimum in variable regions. EPIYA motifs have nine kinds of mutant,which account for 7.18% in all motifs. There are seven kinds of interval sequences in variable regions. “FPLKRHDKVDELIKVG” and “TIDDLGGP” in R3C and R4C motifs were characteristic sequences of Western-type stains. “FPLRRSAAVNDLSKVG” and “TIDFDEAN” in R4D motif were characteristic sequences of East Asian-type stains. Because of the diversity in the order and number of EPIYA-A,-B,-C,-D sites,there are 17 kinds different ABC-types or ABD-types in variable regions of CagA proteins. EPIYA motifs repeats and EPIYA-D sites in East Asian-type strains are significantly less,but EPIYA-A and EPIYA-B sites were more significant than that in Western-type strains. Conclusion: SAS soft package is effectively applied to analysis polymorphisms of variable region in CagA protein sequences. The characters of repeat sequences in variable region of CagA are elucidated on the whole. Their annotations become more reasonable,more systematic and more special than before. Because of the characteristics in polymorphism of cagA protein sequences, and their relationship with cytotoxicity,further research need to be done based on this study to uncover more molecular biological mechanisms.