Extracting kymogram from laryngeal high-speed videos
Kymography is a method for visualizing the vocal cords’ motion which can be used for medical diagnosis. The correctness of this diagnosis is dependent on accuracy of the extracted Kymogram and deteriorates if noise and camera motion interferes with vocal cords’ motion. These problems hard to avoid in practice due to color artifacts, saturated highlights, and noise in captured videos and the fact that the camera is controlled by a human. So, there is an urgent necessity for image enhancement and registration of frames before Kymograph extraction. With the help of Isfahan Rehabilitation Center we have developed a high-speed software that extracts the Kymograph of vocal folds’ motion from laryngoscopy videos with a great precision compared to some previous software used in rehabilitation centers. Although the software has acceptable speed on CPU, The algorithms used in this software are designed in a way that makes them highly suitable for parallelization on GPU. One of the major advantages of this software is that it can be used for low resolution videos, so, it can reduce the cost of the camera needed. This software also provides some extra information for users which can be used for both conducting researches, and also a better diagnosis.
Purpose of this software
“The various causes of dysphonia cannot be properly differentiated without knowledge of the anatomy and physiology of phonation. Furthermore, visualization of the larynx and vocal folds vibration is essential to the diagnosis. Alterations in vocal fold vibration can contribute to the development of laryngeal pathologies or can be the result of such pathologies, having, in either case, a direct impact on the acoustic quality of the voice. Vocal folds vibrate at a frequency of 80 to 1,000 cycles per second. As the human eye is capable of perceiving no more than five images per second, it is impossible to evaluate the vibration of the vocal folds during phonation. According to Talbot’s law, an image projected on the retina will persist for 0.2 seconds. When successive images are presented at intervals of less than 0.2 seconds, they merge and our retina sees the movement as stationary.”
“The opening and closing of the vocal folds at high frequencies is a major source of sound in human speech. Videokymography is a technique for visualizing the motion of the vocal folds for medical diagnosis: The vibrating folds are filmed with an endoscopic camera pointed into the larynx. The camera records at a very high framerate to capture vocal fold vibration. Alternatively, a low framerate and stroboscopic lighting at a frequency synchronized with the vibratory frequency of the vocal folds is used to obtain a temporal subsampling of the motion (see Figure 1 for example frames). The kymogram used for medical diagnosis is essentially a time-slice image, i.e. an X-t-cut through the X-Y-t image cube of the endoscopic video (Figure 2). The quality and diagnostic interpretability of a kymogram deteriorates significantly if the camera moves relative to the scene as this motion interferes with the vibratory motion of the vocal fold in the kymogram. Scene-to-camera motion caused by the patient or the operator of the endoscope is hard to avoid in medical practice. In this paper, we propose an approach to stabilizing the motion of endoscopic video for kymography.”
Figure 1: Sample frames
“This motion compensation problem is challenging and different from motion compensation of handheld video in several respects: Firstly, the camera motion to be eliminated may be significantly larger than a typical camera shake due to the short distance between camera and scene. Secondly, not only the camera and the vocal folds move but the entire scene may be highly non rigid, for example when the ariepiglottic fold and the cuneiform cartilage move when the patient takes breath. Therefore, a 3D camera estimation approach is not possible throughout the entire endoscopic sequence. Finally, the image quality of the input material can be challenging. Depending on the endoscopic system, the algorithm has to cope with high noise levels, large areas of saturated highlights, interlacing artifacts, depth of field blur, false colors, etc.” (see Figure 3).
Because of this high level of noise and the fact that the scene is changing because of both the motion of vocal folds and motion of the hand, some primary algorithms like SIFT and SURF cannot always achieve the desired precision in the results. Meanwhile, the similarity of consecutive frames helps us to use faster and more accurate registration. Another important challenge for designing this software was increasing the speed of the registration and extracting the desired kymogram for high-speed laryngeal videos, so, the user can obtain the kymograph at a reasonable time. We can even obtain more speed by using algorithms suitable for parallelization on GPU. User can also determine different lines on the larynx picture to observe and compare the opening amplitude of glottis in different positions. Parallelization makes possible the ability to obtain more simultaneous kymogram in the same time. This software also enables the ability to automatic comparison of these kymographs by computing the ratio of opening amplitudes of them and also ratio of the time they are open (this time may vary in different positions for some people for example patients suffering from a cyst on one part of the larynx.). The opening amplitude may also vary for right side and left side of the glottis.
Figure 2: Sample Kymogram Figure 3: Sample noise in frames
Team members: Supervisor:
Ali Ebrahimpour Boroojeny Dr. Hossein Rabbani
 D. H. Tsuji, A. Hachiya, M. E. Dajer, C. C. Ishikawa, M. T. Takahashi, and A. N. Montagnoli, “Improvement of vocal pathologies diagnosis using high-speed videolaryngoscopy,” Int. Arch. Otorhinolaryngol., vol. 18, no. 3, pp. 294–302, 2014.
 D. Schneider, “Warp-based motion compensation for endoscopic kymography,” in Proc. Eurographics, 2011.