应用SAS软件分析CagA蛋白序列可变区的多态性
DOI:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

江苏省135重点学科资助课题(2001-31)


Polymorphism analysis of variable region in CagA protein with SAS soft package
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    目的:应用SAS软件进行数据挖掘,系统阐明幽门螺杆菌(H.pylori)CagA序列多态性及其特性。方法:采用SAS软件数据加工整理技术与生物信息学软件序列分析比对相结合的方法,对NCBI、swiss_prot/tremble、DDBJ三大蛋白数据库有关CagA 蛋白序列数据进行整理,建立含97条完整序列及268条3’端部分序列的数据仓库,并对其进行多态性分析。结果:通过SAS程序对数据的整理加工后观察、统计分析,明确了可变区的位置及长度,准确地得出了EPIYA基序的重复频率、多态性及概率分布,EPIYA基序间隔序列的种类、特征及频度。CagA蛋白氨基酸序列长短不等主要是由于可变区的变化引起,可变区平均长度115.76 ± 27.38aa。365例H.pylori菌株CagA可变区内EPIYA基序平均重复3.28 ± 0.72次,最少1次,最多7次。EPIYA有9种突变型,占总数的7.18%。2个EPIYA基序之间的间隔序列,主要有7种,其中 R3C、R4C中“FPLKRHDKVDELIKVG”及“TIDDLGGP”是西方株的特征基序,R3D中的“KIASAGKGVGGFSGAG”,R4D中的“FPLRRSAAVNDLSKVG”及“TIDFDEAN”则是东亚株特征序列。EPIYA及其间隔序列不同组合构成CagA可变区17种不同的类型。在东亚株中EPIYA基序重复次数及EPIYA-D位点数明显少于西方株,EPIYA-A、EPIYA-B位点则多于西方株。结论:应用SAS软件可有效地对CagA蛋白可变区的多态性分析,从整体上把握了CagA可变区的序列特征,较以往的描述更详细、系统、合理。由于CagA序列可变区多态性的特点及与细胞毒性的关系,在此基础上进一步的研究可能揭示更多的分子生物学机制。

    Abstract:

    Objective:To estimate polymorphism in CagA protein systematically and analyze its characteristics by data miningusing SAS 9.0 soft package. Methods:CagA protein sequences were searched in protein databases in NCBI, swiss_prot/tremble and DDBJ. Data modification and program were performed with tools provided by SAS soft package. Data warehouse of CagA protein sequences were then built. These sequences were statistically analyzed by BioEdit7.0, SAS9.0 and student T test. Results:Redundant sequences were found and deleted. Complete CagA protein sequences of 97 strains and 3’ variable regions sequences of 268 Helicobacter pylori strains were obtained. Repeats in variable region were arranged in order in datasheets of SAS programs and observed clearly. The distinct position of variable region in CagA sequence was settled. Exact mean of number,polymorphisms and its distributing probability of EPIYA motif repeats were obtained. Interval sequences between two EPIYA motif repeats,knowledge of its kinds, characters and frequency were also obtained. The average of number of amino acids in variable regions is 115.76 ± 27.38 in 365 strains,which was the dominating causation that induced diversity of CagA proteins in length. EPIYA motifs repeated 3.28 ± 0.72 times in average,seven times in the maximum and one time in minimum in variable regions. EPIYA motifs have nine kinds of mutant,which account for 7.18% in all motifs. There are seven kinds of interval sequences in variable regions. “FPLKRHDKVDELIKVG” and “TIDDLGGP” in R3C and R4C motifs were characteristic sequences of Western-type stains. “FPLRRSAAVNDLSKVG” and “TIDFDEAN” in R4D motif were characteristic sequences of East Asian-type stains. Because of the diversity in the order and number of EPIYA-A,-B,-C,-D sites,there are 17 kinds different ABC-types or ABD-types in variable regions of CagA proteins. EPIYA motifs repeats and EPIYA-D sites in East Asian-type strains are significantly less,but EPIYA-A and EPIYA-B sites were more significant than that in Western-type strains. Conclusion: SAS soft package is effectively applied to analysis polymorphisms of variable region in CagA protein sequences. The characters of repeat sequences in variable region of CagA are elucidated on the whole. Their annotations become more reasonable,more systematic and more special than before. Because of the characteristics in polymorphism of cagA protein sequences, and their relationship with cytotoxicity,further research need to be done based on this study to uncover more molecular biological mechanisms.

    参考文献
    相似文献
    引证文献
引用本文

徐顺福,张国新,施瑞华,郝 波,苗 毅.应用SAS软件分析CagA蛋白序列可变区的多态性[J].南京医科大学学报(自然科学版),2007,(11):1221-1227

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2007-04-29
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期:
  • 出版日期: