Int. J. Innovative Computing and Applications, Vol. 7, No. 1, 2016 1 Video summarisation using optimum global threshold technique based on genetic algorithm Amit Phadikar* Department of Information Technology, MCKV Institute of Engineering, Liluah, Howrah-711204, West Bengal, India Email: [email protected] *Corresponding author Nishant Kumar Department of Computer Science, Techxell School of Management and Technology, Jaina More, Bokaro-829301, Jharkhand, India Email: [email protected] Baisakhi Sur Phadikar Cognizant Technology Solutions, Salt Lake, Kolkata, India Email: [email protected] Goutam Kumar Maity Department of Electronics and Communication Engineering, MCKV Institute of Engineering, Liluha, Howrah, WB, India Email: [email protected] Abstract: Most of the methods for video summarisation rely on complicated clustering algorithms that make them too computationally complex for real time applications. This paper presents an efficient approach for video summary generation that does not relay on complex clustering algorithms and does not require frame length as a parameter. The present scheme combines colour histogram and edge histogram features with optimum global thresholding to detect key frames. The optimum threshold is selected based on genetic algorithm (GA) to increase the performance of the proposed system. For each shot, key frames are extracted and similar key frames are eliminated. Experimental results duly support those claims. Keywords: YCC colour space; colour histogram; edge histogram; genetic algorithm. b r Reference to this paper should be made as follows: Phadikar, A., Kumar, N., Phadikar, B.S. and Maity, G.K. (2016) ‘Video summarisation using optimum global threshold technique based on genetic algorithm’, Int. J. Innovative Computing and Applications, Vol. 7, No. 1, pp.1–12. Biographical notes: Amit Phadikar received his PhD (Engineering) from Bengal Engineering and Science University, Shibpur, India. Currently, he is working as an Associate Professor in the Department of Information Technology, MCKV Institute of Engineering, Liluah, Howrah, West Bengal, India. He has contributed about 50 research papers and two books. Nishant Kumar is an Assistant Professor at Department of Computer Science at Techxell School of Management and Technology, Bokaro, Jharkhand, India. His area of research interests are image and video processing. Baisakhi Sur Phadikar received his BTech in Computer Science and Engineering in 2005 from West Bengal University of Technology, W.B., Kolkata, India. Currently, she is working as a Sr. Associate at Cognizant Technology Solutions, Salt Lake, Kolkata, India. Copyright © 2016 Inderscience Enterprises Ltd. 2 A. Phadikar et al. Goutam Kumar Maity received his PhD (Engineering) from Bengal Engineering and Science University, Shibpur, Howrah, West Bengal. Currently, he is working as an Associate Professor in ECE Department at MCKV Institute of Engineering, Liluah, Howrah, West Bengal, India. His research interests include embedded systems, optical information processing, optical cryptography and data hiding. 1 Introduction A simple approach to extract key frames from shot is based on frame content change computed by features, such Enormous popularity of the internet video repository sites as colour histogram (Zhang et al., 1997) or motion activity like YouTube or Yahoo video caused increasing amount of (Wolf, 1996), etc. Zhu et al. (2004) proposed a hierarchical video content available over the internet. In such a scenario, video summarisation strategy that explores video content it is necessary to have efficient tools that allow fast video structure to provide the users with a scalable, multilevel browsing. In such application, compact representation of video summary. In those above methods, a predefined video data is necessary. Such representation provides the threshold is required to control the number of key frames. user with information about the content of the particular Zeinalpour et al. (2009) proposed a novel method for video sequence being examined. In other words, those tools summarisation using genetic algorithm (GA) (Sarkar and should provide concise representation of the video content Chakraborty, 2011) and information theory. The scheme as a sequence of still or moving pictures, i.e., video relies on the mutual information for video summarisation. summary. Video summary is the abstract of an entire video. This is due to the fact that mutual information provides It is the essence of the entire video provided in a shorter better results as it extracts the inter-frame information. period of time. The main purpose of video summary is Doulamis et al. (2000) proposed a video summarisation viewing time constraints. Video summarisation plays a scheme based on fuzzy representation of visual content. In major role where the resources like storage, communication particular, a multidimensional fuzzy histogram is bandwidth and power are limited. It has several applications constructed for each video frame based on a collection of in security, military, data hiding (Iliyasu et al., 2013) and appropriate features, extracted using video sequence even in entertainment domains (Asadi and Charkari, 2012). analysis. Then key frames are selected optimally by Video summarisation can be carried out in different minimising a cross correlation criterion. Chiu et al. (2000) methods. Each method is suitable in its own domain and can described a genetic segmentation algorithm for video thus give variable results based on a number of parameters. summarisation. For evaluating segmentations, authors There are two main categories of video summarisation define a similarity adjacency function, which are extremely (Truong and Venkatesh, 2007), i.e., static video summary expensive to optimise by traditional methods. Moreover, the and dynamic video skimming. In static video summary evolutionary nature of GA offers a further advantage by methods, a number of representative frames, often called enabling incremental segmentation. Yang and Wei (2011) key frames, are selected from the source video sequence and proposed a GA-based video summarisation scheme for are presented to the viewer. Dynamic video summary soccer video. The scheme first introduces audio features to methods generate a new, much shorter video sequence from improve fitness function which is used to calculate the the source video. Since static video summaries are the most relative differences among all the selected frames. Then the common technique used in practical video browsing scheme employs crossover and mutation operators to get the applications, we focused our research on static video meaningful summary in a video search space. Yang and summarisation. Wei (2012) proposed a video summarisation method based Most of the existing works on static video on GA employing crossover and mutation operators to summarisation are performed by clustering similar frames search for a meaningful summary in a video search space. and selecting representatives per clusters (Mundur et al., Authors investigate both binary and decimal GA. It is seen 2006; Hadi et al., 2006; De Avila et al., 2011; Furini et al., that the binary GA finds the optimal result more quickly and 2007; Furini et al., 2010). A variety of clustering algorithms easily than the decimal GA. Sony et al. (2011) use are applied such as: Delaunay triangulation (Mundur et al., Euclidean distance after clustering to obtain summarised 2006), K-medoids (Hadi et al., 2006), K-means (De Avila frames. This method is based on the removal of redundant et al., 2011), furthest point first (Furini et al., 2007) and frames from a video and returns the user defined number of STIMO (Furini et al., 2010), etc. Although they produce unique frames. Visually similar looking frames are clustered acceptable visual quality, the most of these methods into one group using Euclidean distance. After the clusters relay on complicated clustering algorithms, applied directly are formed, the frames that have larger distance metric are on features extracted from sampled frames. It makes them retrieved from each group to form a sequence. too computationally complex for real-time applications. Another restriction of these approaches is that they require the number of clusters, i.e., representative frames to be set prior. Video summarisation using optimum global threshold technique based on genetic algorithm 3 Guang-sheng (2008) proposed a novel algorithm for low computational complexity, very compact representation shot boundary detection and key frames extraction based on and invariance to resolution changes. image segmentation and attention model. Matching difference between two consecutive frames is computed 2.1 YC C colour space b r with different weights and shot boundaries are detected based on automatic thresholding technique. Then key frame The colour histograms have been commonly used for key is extracted using reference frame-based approach. Thakar frame extraction in frame difference-based techniques and et al. (2012) proposed a scheme based on χ2 histogram also used for image retrieval (Liu et al., 2013). This is which is effective in detecting abrupt and gradual because the colour is one of the most important visual transitions. Moreover, a new pixel intensity-based method is features to describe an image (Liu, et al., 2013). Colour also proposed for detecting fade transition. Angadi and Naik histograms are easy to compute and are robust in case of (2012) proposed a shot boundary detection technique based small camera motions (Rajendra and Keshaveni, 2014). It on local colour moments in YCbCr colour space. By the use has been observed in the literature that YCbCr colour space of YC C colour space, the influence of illumination change always yields better result as compared to other colour b r and shadows are reduced. Dhagdi and Deshmukh (2012) space in case of key frame detection (Mishra and Subban, proposed a new approach for key frame extraction 2014). That is why, the present scheme uses YCbCr colour based on block Histogram difference and edge matching space. Moreover, by the use of YCbCr colour space, the rate. The scheme provides global information about the influence of illumination change and shadows are also to be video content. Moreover, the scheme is faster without any reduced (Angadi and Naik 2012). The difference between performance degradations. Eom and Choe (2005) proposed YCbCr and RGB is that YCbCr represents colour as a fast edge histogram generation technique in discrete brightness and two colour difference signals, while RGB cosine transform (DCT) domain based on the properties of represents colour as red, green and blue. In YCbCr, the Y is AC coefficients. First, they have verified the meaning of the brightness (luma), Cb is blue minus luma (B-Y) and Cr is two AC coefficients. Second, they have measured the edge red minus luma (R-Y). This colour space exploits the orientation using the ratio of the two AC coefficients. properties of the human eye. The eye is more sensitive to Finally, the edge histogram is generated similar to the edge light intensity changes and less sensitive to hue changes. histogram descriptor (EHD) defined by MPEG7 standard. When the amount of information is to be minimised, the The proposed method is compared with the EHD in intensity component can be stored with higher accuracy complexity, and is found to be slightly superior. than the Cb and Cr components. The joint photographers The objective of this paper is to design a fast and engineering group (JPEG) file format makes use of this effective approach for video summary generation that does colour space to throw away unimportant information not relay on complicated clustering algorithms and also (Kekre, et al., 2012). In this paper, Y component is used for does not require length (number of summary frames) as a edge histogram feature and the colour component, parameter. The proposed model is based upon colour and i.e., Cb and Cr are used for colour histogram feature. edge histogram in YCbCr colour space with optimum global RGB images can be converted to YCbCr colour space using thresholding to detect key frames. The optimum threshold is equation (1). selected based on GA (Poli, 1996; Piszcz and Soule, 2007) ⎡Y ⎤ ⎡ 0.2989 0.5866 0.1145⎤ ⎡R⎤ to increase the performance of the system. Experimental ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ C = −0.1688 −0.3312 0.5000 × G (1) results show that there is an improvement of 2.88% in ⎢ b⎥ ⎢ ⎥ ⎢ ⎥ precision rate and 12.39% in recall rate due to the use of GA ⎢⎣Cr⎥⎦ ⎢⎣ 0.5000 −0.4184 −0.0816⎥⎦ ⎢⎣B⎥⎦ in the proposed scheme. The rest of the paper is outlined as: in Section 2, colour 2.2 Edge histogram space and edge histogram have been discussed. Section 3 discusses the proposed work. Performance evaluation is Edge detection is one of the most commonly used discussed in Section 4. Finally, Section 5 discusses the operations in image analysis. Edges define the boundaries conclusion and scope of future works. between regions in an image, which helps in segmentation (Seixas et al., 2009) and object recognition. The edge histogram is used to match the edges of adjacent frames to 2 Colour space and edge histogram eliminate redundant frames (Rajendra and Keshaveni, 2014). Edge detection operators that are commonly used are Frame feature extraction is a crucial part of any key frame viz Roberts operator, Canny operator, Sobel operator, extraction algorithm which directly affects the performances Prewitt operator and the Laplace operator (Dhagdi and of the algorithm. The visual feature which is optimal for Deshmukh, 2012), etc. To find the edge histogram, the video processing applications should satisfy several main image (f) is first divided into (4 × 4) subimages. The present requirements. Those are robustness, discriminability, scheme uses Canny edge detector. It finds edges based on compactness, and low complexity (Cvetkovic et al., 2013). the local maxima of the gradient of image f(x, y). The The present scheme uses colour histogram and edge gradient is calculated using the derivative of the Gaussian histogram descriptors as those features are characterised by filter. The image is smoothed using a Gaussian filter with a 4 A. Phadikar et al. specified standard deviation, to reduce noise. To generate 135-degree diagonal and non-directional edges. Since there the histogram, edges in the sub-images are categorised are 16 subimages, a total of (16 × 5) = 80 histogram bins are into five types; vertical, horizontal, 45-degree diagonal, required (Chang et al., 2001; Manjunath et al., 2002). Figure 1 Flowchart of the proposed scheme Video summarisation using optimum global threshold technique based on genetic algorithm 5 Algorithm 1 Colour histogram Input: Extracted frames of input video. Output: Euclidean distances (EDC and EDC) between two consecutive frames for each component C and C. b r b r Begin 1 For 1st frame do 2 Convert the frame from RGB to YC C colour space using equation (1). b r 3 Calculate the normalised histogram of each component, i.e., C and C using equation (2). b r P =imhist(f,b)/numel(f) (2) i where, the symbol ‘i’ is the frame number, ‘f’ is the colour difference components, i.e., ‘C’ and ‘C’, ‘b’ is the number of bins used b r in forming the histogram (b = 256 for 8-bit grey scale image), and numel(f) is the number of components in array ‘f’ (i.e., the number of pixels in the frame). Finally, the function imhist() calculate the histogram of frame ‘f’ with ‘b’ bins. 4 End for 5 For second frame onwards do 6 Steps 2 and 3. 7 Euclidean distance (EDC and EDC) between two normalised histogram using equation (3) for C and equation (4) for C, b r b r respectively, for two consecutive frames. 1 (3) EDC =∑((P −P )2)2 b oldframe_Cb newframe_Cb 1 (4) EDC =∑((P −P )2)2 r oldframe_Cr newframe_Cr The symbol EDC stores Euclidean distances between two consecutive frames of ‘C’ feature. EDC stores Euclidean distances b b r between two consecutive frames of ‘C’ feature. The symbol P represents the first frame and P represents the r oldframe newframe consecutive second frame. 8 Assign the value of P in P , i.e., newframe oldframe P = P . oldframe newframe 9 Store the values of EDC and EDC. b r 10 End for End 3 Overview of the proposed method Step 2 YC C colour space: frames extracted in b r Step 1 are converted in YC C colour space using We propose an approach which is based on several efficient b r equation (1). video processing procedures. At first, video frames are sampled in order to reduce further computational burden. Step 3 Frame feature extraction: frame feature extraction Then, colour and edge histogram features are extracted on is a crucial part of any key frame extraction pre-sampled video frames and Euclidean distance measure algorithm. It directly affects the performances of is used to evaluate the similarity between the frames. Those the proposed scheme. In this paper, we have used features are deployed for key frames detection using a two features, i.e., colour and edge. This is due to thresholding approach. In the present scheme, the optimum the fact that several methods for retrieving images threshold is selected based on GA and key frame is said to on the basis of colour feature have been described be detected at places where the frame difference is maximal in the literature. Colour feature is the easy and and larger than the global threshold. Then, representative simple to compute. The colour histogram is one of key frames are extracted and similar key frames are the most commonly used features for video eliminated. Figure 1 depicts the flow chart of the proposed summarisation as it is invariant to scaling and scheme. In the rest, detail description of the proposed rotation. Colour histogram of frames in the C b method is presented. (chrominance of blue), and C (chrominance of red) r colour space are calculated. Colour histogram is Step 1 Frame extraction: depending on properties of very effective, for classification of frames based on video, it may have 24–30 frames per second. First, colour. Algorithm 1 shows the steps to find the frames are extracted from the video. Extracted colour histogram and Algorithm 2 shows the steps frames generally contain redundant frames to find the edge histogram. (information). 6 A. Phadikar et al. Algorithm 2 Edge histogram Input: Extracted frames of input video. Output: Euclidean distances (EDY) between two consecutive frames for component Y. Begin 1 For 1st frame do 2 Convert the frame from RGB to YC C using equation (1). b r 3 Store the component(Y), i.e., luminance information. 4 Split the image into (4 × 4) non-overlapping rectangular region. 5 In each region, a (1 × 5) edge histogram is computed (horizontal, vertical, 2 diagonal and 1 non-directional) Say, variable E1 contains 80 histogram bins. 6 End for 7 For second frame onwards do 8 Repeat Steps 2, 3 and 4. 9 In each region, a (1 × 5) edge histogram is computed (horizontal, vertical, 2-diagonal and 1 non-directional). Say, variable E2 contains 80 histogram bins. 10 Calculate the Euclidean distance (EDY) between the two edge histogram using equation (5). 1 EDY =∑((E1−E2)2)2 (5) The symbol EDY represents the Euclidean distance of Y component between two consecutive frames. 11 Assign the value of E2 in E1, i.e., E1 = E2. 12 Store the value of EDY. 13 End for End Step 4 Optimised threshold selection using GA: Key Step 5 Detection of key frames: The proposed model is frames are identified based on the visual content based on colour histogram and edge histogram change. Therefore, the most critical activity in the feature. Given a video which contains many key frame detection process is the selection of the frames, the colour histogram and edge histogram thresholds. In this paper, global thresholding is for each frame is computed and the Euclidean used for detecting key frames using GA. We have distance measure is used to measure the used precision (P) and recall (R) to develop the dissimilarities between the frames, based on the objective function, as both are ranging from 0 threshold that is selected by GA. A key frame is (zero) to 100 and one does not bias the other. The said to be detected if the dissimilarity between the objective is to maximise the fitness function. The frames is higher than the threshold value. normalised fitness function ‘f ’ is then defined as: Algorithm 4 shows the steps for key frames n detection. P+R f = (6) n 2 4 Performance evaluation The precision (P) is defined as the ratio of correctly detected key frame to the sum of correctly detected This section presents the results of the experiments and falsely detected key frame of a video data. The conducted to confirm the success of the proposed model. recall (R) is defined as the ratio of detected key The experimentation is conducted on set of YouTube-videos frame to the sum of detected and undetected key and the open video project video. The detail descriptions of frame. these videos are provided in Table 1. They vary in duration from 901 frames to 2,938 frames. In this experiment, all the Number of relevant frames retrieved P= (7) test videos are abrupt transition videos and Figure 2 shows Total number of frames retrieved the frames of one of the test video. The experiments are Number of relevant frames retrieved conducted in Pentium IV, 2.80 GHz processor with 512 MB R= (8) RAM using MATLAB 7. Total number of frames retrieved Algorithm 3 shows the steps for threshold selection using GA. Video summarisation using optimum global threshold technique based on genetic algorithm 7 Algorithm 3 Genetic algorithm Input: Euclidean distances (EDC, EDC and EDY) between two consecutive frames for components C, C and Y from Algorithm 1 b r b r and Algorithm 2, respectively. Output: Global thresholds (TY, TC and TC) for each components, i.e., Y, C and C. b r b r Begin 1 Formulate initial population randomly. Initial population contains information regarding initial value of TY, TC and TC that are b r randomly selected and within the range [0 1]. 2 Repeat 3 Evaluate the objective function (f ) as defined in equation (6). n 4 Apply genetic operators 5 Selection 6 Crossover 7 Mutation 8 Until stopping criteria (i.e., optimum solution is found). 9 Optimum solution set, i.e., TY, TC and TC are obtained. where, TY is the threshold for Y component; TC is the threshold for C b r b b component; TC is the threshold for C component. r r End Algorithm 4 Key frame detection Input: Euclidean distances (EDY, EDC and EDC) between two consecutive frames for each components Y, C and C from b r b r Algorithm 2 and Algorithm 1, respectively. and Global thresholds (TY, TC and TC) of each components Y, b r C and C from Algorithm 3. b r Output: Key frames Begin 1 If EDC>TC and EDC>TC and EDY>TY then b b r r 2 Select the frame as key frame. 3 End if End Table 1 Description of video used for evaluation Video No. of key Key frame Total number of Indoor/ Camera Persp. Bright. name frames numbers frames outdoor motion changes changes Video 1 8 1, 126, 214, 354, 519, 539, 901 Outdoor Yes Yes Yes 625, 738 Video 2 20 1, 132, 189, 254, 276, 299, 2,908 Outdoor Yes Yes Yes 456, 491, 534, 778, 989, 1,243, 1,439, 1,587, 1,745, 1,952, 2,136, 2,467, 2,654, 2,879 Video 3 18 1, 89, 187, 376, 573, 623, 811, 2,938 Outdoor Yes Yes Yes 934, 1,089, 1,365, 1,524, 1,679, 1,821, 1,956, 2,298, 2,445, 2,786, 2,911 4.1 The gold standard To establish the ground truth, human judges were asked to independently surf the video and provide the key frames. The results of the proposed method are compared with the The key frames estimated by the judges were reviewed in a ground truth agreed by multiple human judges. The goals of group meeting with a final judge to derive final key frames creating the ground truth are to: for each of the video. In practical scenario where the ground 1 create a reference database of video truth is not available, the number of key frames will be defined by users. 2 identify a foundation by which automated algorithms can be used for comparison. 8 A. Phadikar et al. Figure 2 Sampled input video frames (Video 1) (see online version for colours) Figure 3 represents the component wise (component of key frame selection. The schemes described in Thakar et al. YC C colour space, i.e., Y, C and C) Euclidean distance (2012), Angadi and Naik (2012) and Dhagdi and Deshmukh b r b r between the successive frames. Horizontal axis represents (2012) use only one type of frame feature like χ2 histogram, the number of frames where as vertical axis represents the colour moments and edge matching information, normalised frame distance. Frames at high distance are respectively for key frame detection. Those single features tagged as key frames. Figure 4(a) shows the best fitness at alone are not sufficient to maintain all the requirements as each generation. Figure 4(b) shows the average distance described earlier, i.e., robustness, discriminability, between individuals at each generation, which is a good compactness, and low complexity. So those schemes return measure of the diversity of a population in GA based poor performance intern of R and P. Moreover, none of the threshold detection. schemes use GA to select the optimum threshold for Figure 5 shows the key frames extracted by our identifying key frames. This is another reason for the poor proposed algorithm. Figures 6(a) and 6(b) shows the value of P and R than the proposed scheme. summarised video for Video 2 and Video 3, respectively. From Figures 5 and 6 it is seen that the scheme can effectively select the key frames which ultimately returns 5 Conclusions good results of video summarisation. The results indicate In this paper, we have proposed an efficient method for that our approach is among the best with respect to the video summary generation based on GA. Experimental ground truth included in the video summary. Table 2 shows results on standard YouTube video and on the open video the performance of the proposed scheme in term of project video show that proposed scheme offers satisfactory precision and recall. It is seen that the precision and recall performance for key frame detection in term of recall (R) values are quite large for the proposed scheme. This is due and precision rate (P). However, the proposed scheme is to the use of GA to find the optimum threshold of the only concentrated on the abrupt transitions of videos and not proposed scheme. Table 3 shows the performance if only for gradual shot. Abrupt shot boundaries are created by colour feature and un-optimised global threshold are used simply attaching a shot to another. On the other hand, for video summarisation. It is seen that there is an gradual transition results from the editing effects applied to improvement of 2.88% in precision rate and 12.39% in the shots during attachment operation. Depending on the recall rate due to the combined use of colour and edge different editing effect gradual transitions can be further histogram and GA in the proposed scheme. divided into different sub-types, i.e., dissolve, fade (fade in, In Table 4, we have compared the proposed work with fade out) and wipes. The limitation of our proposed method the existing works found in the literature. Table 4 shows is that it does not support gradual shot of video. But most of that the scheme offers good results than the others. This is the videos contain gradual transition. due to the use of GA to find optimum value of threshold for Video summarisation using optimum global threshold technique based on genetic algorithm 9 Figure 3 Normalised frame distances between successive frames (Video 1) (see online version for colours) (a) (b) (c) 10 A. Phadikar et al. Figure 4 Best fitness and distance plot (Video 1) (a) (b) Figure 5 Summarised video frames (Video 1) (see online version for colours) Note: The algorithm shows eight frames, i.e., frame 1, 126, 214, 354, 519, 539, 625, 738. Table 2 Performance of the proposed scheme with colour Table 4 Comparison of proposed and existing work histogram, edge histogram and GA A B C D E No. of Performance of proposed R (in %) 97 90 88 95 93 Size frames work (%) P (in %) 95 92 75 91 92 tested Precision Recall Source: A: Proposed scheme, Video 1 25 MB 901 100 100 B: Guang-sheng (2008), Video 2 4.83 MB 2,908 90 95 C: Thakar et al. (2012), Video 3 14.26 MB 2,938 95 96 D: Angadi and Naik (2012), E: Dhagdi and Deshmukh (2012) Table 3 Performance measure with only colour histogram and Future work can be concentrated for further performance without GA improvement of the proposed scheme by selecting adaptive threshold based on GA and combined use of motion, edge No. of Performance measure (%) and colour to increase the efficiency of key frame detection Size frames tested Precision Recall and also dealing with gradual transitions of videos such as dissolve, fades (fade in, fade out), and wipes. Video 1 25 MB 901 95.72 80.00 Video 2 4.83 MB 2,908 89.90 87.10 Video 3 14.26 MB 2,938 91.40 91.80
Description: