Abstract:Objective: To explore the feasibility of ResNet+VST model based on deep learning in intelligent detection of keyframes of echocardiography. Methods: The 663 dynamic images collected by the Department of Ultrasound Medicine, Affiliated Hospital of Medical School, Nanjing University (including three types of common views for clinical examination, such as apical two chambers, apical three chambers and apical four chambers) and 280 echocardiographic apical four chambers view dynamic images of the EchoNet-Dynamic public dataset were selected to establish the Nanjing Drum Tower Hospital dataset and EchoNet-Dynamic-Tiny dataset, respectively. All kinds of images were divided into training set and test set in 4: 1 way, and the ResNet+VST model was trained and compared with other keyframe detection models to verify the advancement of ResNet+VST model. Results: ResNet+VST model can detect the end-diastolic and end-systolic image frames more accurately. On the dataset of Nanjing Drum Tower Hospital, the frame differences of end-diastolic prediction for apical two chambers, apical three chambers and apical four chambers data models were 1.52±1.09, 1.62±1.43 and 1.27±1.17, respectively, and the end-systolic prediction frame differences were 1.56±1.16, 1.62±1.43, 1.45±1.38, respectively; and on the EchoNet-Dynamic-Tiny dataset, the end-diastolic prediction frame differences of the apical four chambers model was 1.62 ±1.26, the end-systolic prediction frame differences was 1.71±1.18, which is better than the existing related studies. In addition, the ResNet+VST model has a good real-time performance. On the dataset of Nanjing Drum Tower Hospital and EchoNet-Dynamic-Tiny, the average time for inferencing 16 frames of ultrasonic sequence fragments based on GTX 3090Ti GPU was 21ms and 10ms respectively, which is better than the related researches on time series modeling with long short-term memory cells, and basically meets the needs of clinical real-time processing. Conclusion: Compared with the existing research, the ResNet+VST model proposed in this study has a better performance in the accuracy and real-time detection of keyframes. The model can be extended to any ultrasound section in principle and has the potential to assist ultrasound doctors to improve diagnostic efficiency.