Objective:To explore the feasibility of using ResNet+VST model based on deep learning(DL)for intelligent detection of key frames in echocardiography. Methods:A total of 663 dynamic images including apical two chambers(A2C),apical three chambers (A3C),and apical four chambers(A4C),which are commonly used clinical examination views,were collected from the Department of Ultrasound Medicine at Drum Tower Hospital,Nanjing University Medical School. Additionally,280 dynamic A4C images from the EchoNet-Dynamic public dataset were selected. Two datasets were established:the Nanjing Drum Tower Hospital dataset and the EchoNet-Dynamic-Tiny dataset. The images in each category were divided into training set and testing sets in a 4:1 ratio. The ResNet+ VST model was trained and its performance was compared with other key frame detection models to verify the its superiority. Results: The ResNet+VST model can detect the end-diastolic(ED)and end-systolic(ES)image frames of the heart more accurately. On the Nanjing Drum Tower Hospital dataset,the model achieved ED frame prediction differences of 1.52±1.09,1.62±1.43,and 1.27±1.17 for A2C,A3C,and A4C views,respectively,and ES frame prediction differences of 1.56±1.16,1.62±1.43,and 1.45±1.38,respectively. On the EchoNet-Dynamic-Tiny dataset,the model achieved an ED frame prediction difference of 1.62±1.26 and an ES frame prediction difference of 1.71 ± 1.18,outperforming existing related studies. Furthermore,the ResNet + VST model exhibited good real-time performance,with average inference times of 21 ms and 10 ms for 16-frame ultrasound sequences on the Nanjing Drum Tower Hospital dataset and the EchoNet -Dynamic -Tiny dataset,respectively,using the GTX 3090Ti GPU. This performance was superior to related studies that used long short -term memory(LSTM)for temporal modeling and met the requirements for clinical real -time processing. Conclusion:The proposed ResNet + VST model demonstrates superior accuracy and real-time performance in the detection of key frames in echocardiography compared to existing research. In principle,this model can be applied to any ultrasound view and has the potential to assist ultrasound physicians in improving diagnostic efficiency.