SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 1 Texture Retrieval via the Scattering Transform arXiv:1501.02655v1 [cs.IR] 12 Jan 2015 Alexander Sagel, Dominik Meyer, Student Member, IEEE and Hao Shen, Member, IEEE Abstract—This work studies the problem of content-based image retrieval, specifically, texture retrieval. Our approach employs a recently developed method, the so-called Scattering transform, for the process of feature extraction in texture retrieval. It shares a distinctive property of providing a robust representation, which is stable with respect to spatial deformations. Recent work has demonstrated its capability for texture classification, and hence as a promising candidate for the problem of texture retrieval. Moreover, we adopt a common approach of measuring the similarity of textures by comparing the subband histograms of a filterbank transform via the Kullback-Leibler divergence. Despite the popularity of describing histograms using parametrized probability density functions, such as the Generalized Gaussian Distribution (GGD), it is unfortunately not applicable for describing most of the Scattering transform subbands, due to the complex modulus performed on each one of them. In this work, we propose to use the Weibull distribution to model the Scattering subbands of descendant layers. Our numerical experiments demonstrated the effectiveness of the proposes approach, in comparison with several state of the arts. Index Terms—Content-based image retrieval, feature extraction, Kullback-Leibler divergence, Scattering transform, similarity measure, Weibull distribution. I. I NTRODUCTION A. Overview ONTENT-BASED image retrieval (CBIR) is a special case of image classification. It can be viewed as the process of assigning a query image to a set of image classes, where each class represents a database image. Content-based hereby refers to the mode of formulating the search query. As opposed to metadata-based image search, which relies on prelabeling the images beforehand, a CBIR system retrieves the n best matches within the database with respect to the visual similarity to the query image. In CBIR, the concepts of feature extraction (FE) and similarity measure (SM) play crucial roles. FE converts an image into a feature vector of numerical values with the aim to produce a low-dimensional, yet sensible representation of the input image in the context of some particular application. An SM assigns a numerical value to a pair of two feature vectors. In this work, we assume that the SM is nonnegative and that a smaller SM value indicates a higher similarity and vice versa. A typical CBIR system is depicted in Fig. 1. The images to be extracted are processed via an FE algorithm and the extracted feature vectors (signatures) are stored in a database. C The authors are with the Department of Electrical, Electronic and Computer Engineering, Technische Universit¨at M¨unchen, M¨unchen, Germany. e-mail: {a.sagel,dominik.meyer,hao.shen}@tum.de. This work has partially been supported by the Cluster of Excellence CoTeSys – Cognition for Technical Systems, funded by the Deutsche Forschungsgemeinschaft (DFG) and International Graduate School of Science and Engineering (IGSSE), Technische Universit¨at M¨unchen. Signature database Input images FE DB Output images Query image FE Fig. 1. SM A typical architecture of content-based image retrival. The query image is fed to the same FE algorithm and the output is compared with all the feature vectors in the database by means of the defined SM. As a result, the images with the lowest SM values are returned. Consequently, designing a CBIR system boils down to the construction of an FE/SM framework. Nowadays, CBIR is employed to deal with big databases and vast amounts of data. Thus, in additional to high retrieval efficiency, execution speed is of great concern. This requires sufficient dimensionality reduction within the FE and a computationally efficient SM. B. Related Work and Our Contribution Extracting sensible information from images has been a field of research in digital signal processing for roughly half a century now. Naturally, the choice of the FE method depends on the specific application. For instance, when comparing images of landscapes, pictures taken in a forest will have different color distributions from those taken in a desert, so an FE based on the color properties of the image is a reasonable approach. On the other hand, images of objects are strongly characterized by shapes, so in this case the FE could rely on extracted contours. This work focuses on CBIR for textures. The earliest approaches for texture FE, such as the graylevel co-occurrence matrices (GLCM) [1] and the biologically motivated proposed Tamura features [2] have been presented in the 1970s and are often based on numerical measures for how nearby pixels relate to each other. More recent approaches include incorporating spatialfrequency representations of the images [3], such as the Fast Wavelet Transform (FWT) [4], Gabor frames [5] or Curvelet frames [6]. A conventional approach is designing the features based on the energy of the respective transform coefficients, but it can be beneficial to look at their statistical properties SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING [7], [8]. Thus, FE approaches based on the histograms of filterbank transforms have become more prominent, cf. [9]– [14]. Techniques of this kind are referred to as Subband Histogram methods in the work. Linear filterbank transforms feed the input image to a bank of frequency-selective filters, yielding a set of band-pass signals as a representation. One problem of constructing FE based on such a decomposition is that the higher frequency subbands are prone to deformations in the spatial domain. Thereby, even slight deformations of an image will yield significantly different transform coefficients. However, simply neglecting the higher-frequency subbands is not desirable since they carry important information about such distinctive features as edges and corners. As a remedy, Mallat introduced the Scattering transform [15], a non-linear signal representation involving filterbanks and the modulus operation which transforms highfrequency components into low-pass representations. The Scattering transform appears to perform well in texture classification tasks [16]–[18]. The key contribution of this work is to define a statistical model in order to design subband histogram FE/SM for texture images based on their Scattering transform coefficients. The paper is organized as follows. Section II provides necessary mathematical preliminaries of the problem of CBIR, subband histogram methods, and the basics of the Scattering transform. In Section III we develop a statistical framework of CBIR based on the Scattering transform. In Section IV several numerical experiments are presented to investigate the performance of the proposed approach in comparison with the state of the art methods. II. M ATHEMATICAL P RELIMINARIES 2 The likelihood of a set of random realizations {x1 , ..., xN } for a PDF p is defined as L(x1 , ..., xN |p) = N Y p(xi ), (1) i=1 and is a measure for the probability that {x1 , ..., xN } is subjected to p. That is to say, L(x1 , ..., xN |pd ) can be viewed as a measure of similarity between the query image and the database images. Since the natural logarithm ln is a monotone function, this is also true for the log-likelihood ln L. Let A denote the set of the PDFs of a number of database images. The best match for a query {x1 , ..., xN } can thus be determined via p∗d = arg max ln L(x1 , ..., xN |pd ) pd ∈A = arg max pd ∈A N X (2) ln pd (xi ). i=1 Expression (2) is called a Maximum Likelihood (ML) solution. The Kullback-Leibler-Divergence (KLD) between two PDFs p and q with the support S is defined as Z p(x) dx. (3) D(pkq) = p(x) ln q(x) S Suppose that {x1 , ..., xN } are subjected to the PDF pq . Then it can be derived from the law of large numbers, that minimizing for the KLD between pq and pd yields the same results as maximizing for the (log-)likelihood of {x1 , ..., xN } for pd , provided that N is sufficiently large. Thus, (2) can be reformulated as p∗d = arg min D(pq kpd ). (4) pd ∈A A. Notations Unless stated otherwise, a signal denotes an element of the Lebesgue space L2 (R2 ). The terms almost everywhere and almost all refer to conditions which hold for all R2 \N , where N can be any null set. For the sake of simplicity, we refer to L2 (R2 ) as a Hilbert space of functions, even though it would be more precise to call them equivalence classes of functions that are equal all most everywhere. Bold-faced lowercase letters x or xi describe vectors, while regular lowercase letters like x or xi describe scalar values. Depending on the context, uppercase letters stand either for scalar values or for matrices. For a complex value z, z¯ denotes its conjugate, and <(z), =(z) the real and imaginary part, respectively. Angle brackets h·, ·i denote the scalar product of L2 (R2 ) and k · k the canonical norm induced by it. An asterisk denotes the convolution f ∗ g of two signals f, g. B. Subband Histogram Methods The coefficients {x1 , ..., xN } produced by a transform of the query image can be viewed as a set of realizations of a random experiment. Let pd describe the probability density function (PDF) of the transform coefficients of some database image. The authors of [9] propose features based on the statistical moments of the coefficients generated by an FWT of a texture image. To this end, each FWT subband is modeled by a twoparameter GGD. The FE consists of obtaining the two GGD parameters for each subband. Since the KLD of two GGDs can be directly computed from the two GGD parameters, for every pair of images, the KLD for each subband can be estimated directly from the feature vectors. Summing up all the subband KLDs for a pair of images yields an SM. The summation is justified by the fact that the KLD of two joint distributions of statistically independent variables is the sum of the KLDs belonging to the respective marginal distributions. The described SM is hence the estimated KLD between joint distributions for all the subbands produced by an image. In ML based CBIR, the KLD was employed as a measure of probability that a match is correct. However, modeling the subband histograms with a parametrized PDF can be also motivated bilogically. In fact, it has been observed [19] that the gray-value distributions of the filterbank outputs of a texture correspond to perceivable texture features. This gives rise to a variety of similar FE/SM approaches. In particular, replacing the FWT by any other filterbank transform might yield more descriptive features. Furthermore, the KLD could be replaced by any of the vast number [20] of SMs for PDFs. For instance, SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 3 the symmetrized KLD between two distributions p and q is defined as D(qkp) + D(pkq) , 2 and has the advantage that the symmetry property Dsym (pkq) = Dsym (pkq) = Dsym (qkp) (5) Sφ,ψ,J,R [f ; p] = φJ ∗ |ψjm ,Rm ∗ | · · · ∗ |ψj1 ,R1 ∗ f | · · · ||. (13) Let us denote by PM the (ordered) set of all possible paths p of size 0 to M . We can then define the Finite-path WST as (6) always holds. C. The Scattering transform 1) Definition: Let θ ∈ L2 (R2 ) be a rotationally symmetric window function with low-pass characteristics. Let η ∈ R2 \ {0} and J ∈ Z be fixed and R ⊂ SO(2) a finite group | of rotation matrices. With ψ(x) = θ(x)eiη x , we define the wavelet ψj,R as ψj,R (x) = 4−j ψ(2−j Rx), j ∈ {J, J − 1, ...}, R ∈ R. (7) Further, let us assume a low-pass and rotationally symmetric scaling function [21] φ ∈ L2 (R2 ) and define φJ (x) = 4−J φ(2−J x), 2. The WST along the path p = ((j1 , R1 ), ..., (jm , Rm )) of scaling factors and rotations is defined as (8) Sφ,ψ,J,R [f ; PM ] = {Sφ,ψ,J,R [f ; p]}p∈PM . 2) Properties: The output signals of the Finite-path WST are the elements of (14), also referred to as WST subbands in the following, which are all low-pass signals due to the filtering with φJ . With increasing M , Sφ,ψ,J,R [f ; PM ] captures more and more energy of the input signal f and the WST representation becomes more expressive as the maximum path length grows. Let the Scattering norm be defined as X 2 kSφ,ψ,J,R [f ; p]k2 . (15) kSφ,ψ,J,R [f ; PM ]kS = p∈PM As a consequence of the tight frame property (9) of D, the infinite-path WST is unitary, i.e. kSφ,ψ,J,R [f ; P∞ ]kS = lim kSφ,ψ,J,R [f ; PM ]kS M →∞ = kf k. ˆ φ, ˆ such that for the respective Fourier transforms ψ, ˆ J ω)|2 + |φ(2 J X X ˆ J Rω)|2 = 1 |ψ(2 (9) j=−∞ R∈R holds, for almost all ω ∈ R2 . Then the set D = {φ¯J (u − x)}u∈R2 ∪ {ψ¯j,R (u − x)}u∈R2 ,j∈Z∧j≤J,R∈R (10) constitutes a Parseval-tight frame [21]. It will be assumed to span L2 (R2 ) in the following. Note that for two signals f, g, the equation (f ∗ g)(u) = hf (x), g¯(u − x)i (11) holds. At the core of the Windowed Scattering transform (WST) [15] is a dyadic wavelet decomposition Uφ,ψ,J,R [f ; j, R] of the input signal f ∈ L2 (R2 ) with the complex modulus performed on the band-pass components, defined as ( |ψj,R ∗ f |, j ≤ J, Uφ,ψ,J,R [f ; j, R] = (12) φJ ∗ f, j > J. The modulus operation | · | traverses some of the energy of the band-pass signals towards lower frequencies. Therefore, Uφ,ψ,J,R can be applied to the output signals |ψj,R ∗ f | again. This yields a tree structure like the one depicted in Fig. 2. Note that (12) yields an infinite (but countable) number of output signals. However, in practice we deal with bandlimited input signals f . Hence, there exists an integer Jl with Jl < J, such that Eq. (12) needs to be evaluated for j ≥ Jl only, which corresponds to a tree with finite breadth. Basically, the idea of the WST is to apply Uφ,ψ,J,R successively to the input signal and only keep the low-pass signals, i.e. to neglect signals represented by the black nodes in Fig. (14) (16) In practice, maximum path depths of M = 3 are expected to capture all essential information [22], [23]. Actually reconstructing f from its WST involves a phase recovery problem. However, it is known [24] that digital realizations of wavelet representations similar to those employed in the WST are uniquely determined by their complex modulus. The WST is non-expansive [15], i.e. for two signals f1 , f2 , it can be shown [15] that for any positive integer M , kSφ,ψ,J,R [f1 ; PM ] − Sφ,ψ,J,R [f2 ; PM ]k2S X = kSφ,ψ,J,R [f1 ; p] − Sφ,ψ,J,R [f2 ; p]k2 (17) p∈PM ≤kf1 − f2 k2 holds, implying that the WST is robust with respect to additive noise. Since WST is a representation consisting solely of lowpass signals, it is stable with respect to small spatial translations and deformations. Specifically, let f˜ be a deformed or translated version of f . Under certain mild assumptions about the underlying wavelet frame, it has been proven that the error kSφ,ψ,J,R [f ; PM ] − Sφ,ψ,J,R [f˜; PM ]k is bounded asymptotically, cf. [15]. Note, that we are given some freedom of choice in the wavelet frame which allows for considerable flexibility in terms of parameters such as frequency selectivity or directionality. Due to its stability and flexibility, the WST is a convenient signal representation and can be used as the foundation for the FE from images. 3) Normalized WST: In order to reduce redundancy and to increase invariance to distortions, the Normalized WST (NWST) [23] was introduced. Let us define p0 to be the “empty” path corresponding to the layer m = 0 and p˜ ∈ PM to be the predecessor of SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 4 f Uφ,ψ,J,R Uφ,ψ,J,R Uφ,ψ,J,R Uφ,ψ,J,R Uφ,ψ,J,R Uφ,ψ,J,R Uφ,ψ,J,R Uφ,ψ,J,R Uφ,ψ,J,R Uφ,ψ,J,R Uφ,ψ,J,R Uφ,ψ,J,R Uφ,ψ,J,R Fig. 2. WST tree produced by successive application of Uφ,ψ,J,R on the input signal f . Lowpass signals (The WST subbands) are depicted as white nodes. The complex moduli of band-pass signals are depicted as black nodes. p ∈ PM , i.e. p = ((j1 , R1 ), ..., (jm , Rm )) implies p˜ = ((j1 , R1 ), ..., (jm−1 , Rm−1 )). The NWST is defined as ( φJ ∗ f if p = p0 , ¯ Sφ,ψ,J,R [f ; p] = Sφ,ψ,J,R [f ;p] (18) otherwise. |Sφ,ψ,J,R [f ;p]| ˜ In words, each subband of the WST is normalized by the respective parent subband. In practice, a small constant is added to the denominator in order to avoid division by zero. III. S TATISTICAL S CATTERING CBIR A. Subband Modeling We propose to model the gray-value distributions of the different WST subbands with parametrized PDFs and describe the images in terms of their respective parameters to obtain a complete FE mechanism on top of the WST. 1) Root layer: We refer to the sets of WST subbands belonging to the same path length as layers. The root layer, i.e. the layer with m = 0 consists solely of the signal Sφ,ψ,J,R [f ; p0 ] = φJ ∗ f. (19) The low-pass filter φJ serves as a blurring filter, implying that φJ ∗ f is a low-resolution version of f . Hence, the gray-value distribution of φJ ∗ f is expected to have a shape similar to the one of f . Typically, histograms of natural images can be modeled by multimodal PDFs. As a less elaborate alternative which is widely used as a model for bandpass components, cf. [25], we propose to utilize the GGD, i.e. pGGD (x|α, β) = β β e−(|x|/α) , 2αΓ(1/β) where Γ denotes the Gamma-Function, Z ∞ Γ(t) = xt−1 e−x dx, (20) (21) 0 and the parameter α ∈ R+ determines the scale of the distribution while β ∈ R+ describes the shape. Despite the simplicity of the GGD, its two parameters provide effective measures of perceivable visual features such as contrast. The GGD parameters α and β can be estimated as a solution of an ML problem. Detailed derivation of a numerical algorithm for estimating the parameters (α, β) is given in Appendix A. 2) Descendant layer: The layers m ≥ 1 contain signals of the form Sφ,ψ,J,R [f ; p] = (22) φJ ∗ |ψjm ,Rm ∗ | · · · ∗ |ψj1 ,R1 ∗ f | · · · ||, p 6= p0 . Clearly, the modulus eliminates all negative values which is why a symmetric model such as the GGD is not appropriate anymore. The histograms of signals from descendant layers suggest a PDF that occupies only positive values and can describe a skew to the left, which makes the Weibull Distribution (WD) a nearby choice. Fig. 3 exemplarily shows histograms of WST subbands of different textures with a fitted WD curve. Despite variations in skewness and spread, the WD fittings describe the actual histograms fairly well. The PDF of the WD is defined for x ≥ 0 as k pWD (x|λ, k) = λk · (λx)k−1 e−(λ·x) . + (23) Analogous to the GGD, λ ∈ R is the scale parameter, whereas k ∈ R+ determines the shape. The above argument is purely empirical. The following discussion aims to justify further the choice of the WD as a model for WST subbands (22) in the layers m ≥ 1: Like for the layer m = 0, we neglect the impact of the low-pass filter φJ . The approximately analytical band-pass filters ψj,R produce signals with similar zero-mean distributions in their real and imaginary parts. Moreover, it is known that bandpass components of a natural image can be well modeled by a GGD. It is therefore natural to assume the GGD with the same parameters for the distribution of the real and imaginary parts of the band-pass signals ψjm ,Rm ∗ | · · · ∗ |ψj1 ,R1 ∗ f | · · · | involved in the WST. By neglecting statistical dependence between real and imaginary components, let us define a “complex SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING bark2, Layer 2 bark2, Layer 1 10 B. Scattering Similarity 50 5 0.05 0.1 0.15 0.2 floor1, Layer 1 0 0 0.01 0.02 0.03 0.04 0.05 floor1, Layer 2 20 0.05 0.1 0.15 0.2 floor2, Layer 1 0.05 0.1 0.15 pebbles, Layer 1 0 0.01 0.02 0.03 0.04 0.05 floor2, Layer 2 5 0 0.01 0.02 0.03 0.04 0.05 pebbles, Layer 2 0.05 0.1 0.15 0.2 0.25 glass2, Layer 1 0 0.02 0.04 0.06 glass2, Layer 2 0.08 50 50 10 0 50 10 50 10 0 water, Layer 2 water, Layer 1 10 50 5 0 5 10 5 0 0.05 0.1 0.15 0.2 carpet1, Layer 1 0 0 0.05 0.02 0.04 0.06 carpet1, Layer 2 0.1 0.15 0.2 curdory, Layer 1 0 0.02 0.04 0.06 curdory, Layer 2 10 5 0 20 0.1 0.2 0.3 0 20 5 0 0.1 0.02 0.04 0.06 0.08 1) PDF Similarity with parametrized KLD: The extracted GGD and WD parameters constitute a feature vector for each image. Following the approach from [9], we utilize the KLD as defined in (3) as a measure of similarity between two PDFs. The equation (3) can be further simplified when evaluating the KLD for two GGDs, p(x) = pGGD (x|α1 , β1 ), q(x) = pGGD (x|α2 , β2 ), or two WDs, p(x) = pWD (x|λ1 , k1 ), q(x) = pWD (x|λ2 , k2 )). With the application of some standard integration rules, the KLD for two GGDs amounts to 0.2 0.3 0 0.02 0.04 0.06 0.08 1 β1 α2 Γ(1/β2 ) − β2 α1 Γ(1/β1 ) β1 (27) β2 α1 Γ((β2 + 1)/β1 ) + , α2 Γ(1/β1 ) DGGD (α1 , β1 kα2 , β2 ) = ln Fig. 3. Histograms (blue) and respective ML Weibull fittings (red) of WST subbands from ayers m = 1 and m = 2 for different texture patches from the UIUC texture database GGD” with statistically independent real and imaginary parts as pCGGD (z|α, β) = pGGD (<(z)|α, β)pGGD (=(z)|α, β). (24) In order to determine the PDF for the modulus x = |z| ≥ 0, we need to fix |z| and integrate pCGGD over the circles centered at 0 in the complex plane: Z pCGGD,abs (x|α, β) = pCGGD (z|α, β)dz |z|=x = β αΓ(1/β) 2 Z x π 2 e−x β (25) ((cos ϕ)β +(sin ϕ)β )/αβ dϕ 0 Evaluating the integral analytically is demanding, if not impossible. However, it is known that the modulus of a complex variable with statistically independent and Gaussian distributed real and imaginary parts abides the Rayleigh distribution, √ which can be easily verified by substituting α = 2σ and β = 2 into Eq. (25), i.e. √ 2 x 2 pCGGD,abs (x| 2σ, 2) = 2 ex /(2σ ) . σ (26) The usage of the GGD for modeling FWT subbands was motivated by the need to model histograms which are more or less peaked in shape than it is the case for the Gaussian distribution. For this purpose, the additional shape parameter β was introduced. This corresponds to introducing an additional shape parameter to the Rayleigh distribution. Altering the peakedness of the real and imaginary part corresponds roughly to altering the skewness of the respective modulus distribution. From Eq. (25), we can observe that pCGGD,abs (0|α, β) = 0, so a good alternative to pCGGD,abs for modeling the WST subbands would be a two-parameter generalization of the Rayleigh distribution which is controllable in its skewness without changing its value at 0. For k > 1, these requirements are fulfilled by the WD. Similarly, we employ a numerical approach to estimate the Weibull parameters via minimizing the corresponding ML function. The derivation of the algorithm can be found in Appendix B. as indicated in [9]. With γ ≈ 0.57722 being the EulerMascheroni constant, it can be derived that the KLD between two WDs is given by ! k1 λk22 −1 DWD (λ1 , k1 kλ2 , k2 ) = ln k2 λk11 k2 λ1 k2 (28) + Γ +1 λ2 k1 γ + (k1 − k2 ) ln λ1 − , k1 see [26] for details. 2) KLD Sum: An SM for two images I1 , I2 can be defined by summing up all the KLDs for the layer m = 0 (GGD) and the remaining layers 1 ≤ m ≤ M (WD). To this end let us assume an indexing i ∈ {0, ..., N } of all the subbands of a WST with i = 0 denoting the subband at the layer m = 0. Let us further denote the probability that the gray value xi appears at a pixel in the subband i as p1,i (xi ) for image I1 and as p2,i (xi ) for image I2 . The joint probabilities that a combination (x1 , ..., xN )> of gray values appears for the whole WST representation shall be given by p1 (x0 , ..., xN ) and p2 (x0 , ..., xN ), respectively. Let us assume statistical independence of the WST subbands, i.e. pn (x0 , ..., xN ) = N Y pn,i (xi ), n ∈ {1, 2}. (29) i=0 Then, the expression DW ST (I1 kI2 ) = D(p1 kp2 ) = D(p1,0 kp2,0 ) + N X D(p1,i kp2,i ) i=1 = DGGD (α1 , β1 kα2 , β2 ) + N X (30) DWD (λ1,i , k1,i kλ2,i , k2,i ) i=1 describes the KLD between p1 and p2 , where the last equation of (30) expresses the KLD in terms of the GGD and WD parameters of the different subbands, extracted from the images I1 , I2 . SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING DWF+GGD 78.12% C. Implementing the Framework The two previous sections provide the theoretical basis for building a complete CBIR based on WST Subband histograms. In order to extract the features, the WST (Section II-C) is performed on each image. Each subband is then subjected to ML to extract the GGD parameters for layer m = 0 and the WD parameters for all the other layers. If the number of subbands is N , this amounts to feature vectors of size 2N . In order to compare the query feature vector to the feature vectors in the database, the parametrized KLDs are evaluated for each subband via (27) and (28). Their sum (30) yields the SM. As most of these steps can be performed independently of each other, the whole system is well suited to be implemented in a parallel manner, for example in applications for Big Data. IV. N UMERICAL E XPERIMENTS A. Experimental Settings 1) Implementation: All experiments were performed in a Matlab 2014a environment. The code reproducing the key results is available online1 . In addition to our approach, the FWT + GGD method was implemented according to [9] with two minor changes in order to further increase performance. First, the low-pass components were included in the FE. Second, the symmetrized version (5) of the KLD was employed. As a further development of the FWT + GGD method, we implemented the Dual-Tree Complex Wavelet Transform (DT-CWT) + GGD approach which is almost identical to the former, with the exception that instead of FWT subbands, the real parts of the DT-CWT [27] subbands, obtained via the DT-CWT Transform Pack2 , version 4.3, were fed to the ML estimator. For our approach, the WST was performed by means of the ScatNet3 toolbox, version 0.2. The Weibull parameters were extracted via wblfit from the Statistical Toolbox. Likewise, the similarity was measured via the symmetrized KLD. 2) Choice of parameters: Along with the regular WST, all experiments were also conducted with its normalized version as defined in (18). The directionality parameter, corresponding to the cardinality of the rotation group R, is fixed to be L = 4. All computations were performed with double precision and the subbands of the WST are critically sampled according to the Shannon-Nyquist sampling theorem. The bandwidth of the low-pass filter φJ is chosen in a way such that it produces signals of size 16 × 16 = 256, in accordance with [9]. The WST is evaluated for the maximum path lengths M = 2 and M = 3. In theory, a linear increase in M leads to an exponential increase in the number of subbands. However, since the modulus operation hardly traverses any energy toward higher frequencies, they become negligible with increasing path length. Apart from the mentioned, default settings of ScatNet were used. In particular, the Morlet Wavelet was used as the band-pass filter ψ on each layer of the WST. For the FWT + GGD experiments, the 4-tap orthogonal Daubechies filter, cf. [21], was employed. The DT-CWT transform was run with the options antonini and qshift_a. In both 1 https://www.ldv.ei.tum.de/uploads/media/texture 6 FWT+GΓD 78.40% WD-HMM 80.05% Rotated Wavelets 82.81% TABLE II P ERFORMANCE OF STATE OF THE ART METHODS ON DATABASE D1 cases, the size of the smallest subband was 16 × 16, which corresponds to three levels of decomposition. 3) Data: The database D1 is the same as used in [9] and is based on the VisTex database4 . It consists of 40 texture images of the size 512 × 512, each divided into 16 non-overlapping 128 × 128 patches. This amounts to 640 image samples of 40 different textures depicted in Fig. 4. The database D2 was generated from images of the following subset of the UIUC texture database5 : Bark1, Bark2, Wood2, Wood3, Water, Marble, Floor1, Floor2, Pebbles, Wall, Brick1, Glass1, Glass2, Carpet1, Carpet2, Wallpaper, Fur, Knit, Curdoroy, Plaid. Two 640 × 480 images from each class are used to create five overlapping 256 × 256 patches which are then scaled down to half the edge size. Hence, we get a database of 20 different texture classes each containing ten 128 × 128 patches. Fig. 5 depicts one patch from each utilized image. All image patches are normalized to zero mean and unit energy, in order to avoid any bias caused by the overall lighting condition of each original texture image. The set of all patches generated from the same texture is considered a class. Its cardinality will be denoted by c in the following. Consequently, c = 16 for D1 and c = 10 for D2. For each image patch, the c−1 most similar patches were retrieved. The retrieval rate for each patch is defined as the ratio of the number of retrieved patches from the same class to c−1. The overall retrieval rate is the average of the retrieval rates for all the images in the database. B. Results 1) Overview: The retrieval rates are summarized in Table I. As a benchmark, the results are compared to those produced by two other subband histogram methods, FWT + GGD and DT-CWT + GGD. All methods have in common that they first obtain a statistical model for each texture image and then measure the similarity by means of the KLD. While WST + WD fails in comparison to DT-CWT + GGD on Database D1, NWST + WD is able to outperform all of the competing frameworks by 1.84% for M = 2 and 2.96% for M = 3. Database D1 is widely used as a benchmark for CBIR retrieval. In order to provide a sense for the state of the art, Table II summarizes the results from recent publications on comparable approaches: Discrete Wavelet Frames (DWF) + GGD [9], DWT + Generalized Gamma Distribution (GΓD) [13], Wavelet Domain Hidden Markov Models (WD-HMM) [10] and Rotated Complex Wavelets [28]. To the authors’ best knowledge, the best published result for this experiment so far was produced by Rotated Complex Wavelets with a retrieval rate of 82.81% which is still outperformed by 0.53% by NWST + WD with M = 3. retrieval scattering 14.zip 2 http://www-sigproc.eng.cam.ac.uk/Main/NGK 4 http://vismod.media.mit.edu/vismod/imagery/VisionTexture/distribution.html 3 http://www.di.ens.fr/data/software/scatnet/ 5 http://www-cvr.ai.uiuc.edu/ponce grp/data/ SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 7 Fig. 4. Database D1 from left to right and top to bottom, with row and column in brackets: Bark0 (1,1), Bark6 (1,2), Bark8 (1,3), Bark9 (1,4), Brick1 (1,5), Brick4 (1,6), Brick5 (1,7), Buildings9 (1,8), Fabric0 (2,1), Fabric4 (2,2), Fabric7 (2,3), Fabric9 (2,4), Fabric11 (2,5), Fabric14 (2,6), Fabric15 (2,7), Fabric17 (2,8), Fabric18 (3,1), Flowers5 (3,2), Food0 (3,3), Food5 (3,4), Food 8 (3,5), Grass1 (3,6), Leaves8 (3,7), Leaves10 (3,8), Leaves11 (4,1), Leaves12 (4,2), Leaves16 (4,3), Metal0 (4,4), Metal2 (4,5), Misc2 (4,6), Sand0 (4,7), Stone1 (4,8), Stone4 (5,1), Terrain10 (5,2), Tile1 (5,3), Tile4 (5,4), Tile7 (5,5), Water5 (5,6), Wood1 (5,7), and Wood2 (5,8). Fig. 5. Database D2 (patches) from left to right and top to bottom: Bark1, Bark2, Wood2, Wood3, Water, Marble, Floor1, Floor2, Pebbles, Wall, Brick1, Glass1, Glass2, Carpet1, Carpet2, Wallpaper, Fur, Knit, Curdoroy, Plaid. SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING Database D1 Database D2 8 FWT + GGD DT-CWT + GGD WST + WD M=2 WST + WD M=3 NWST + WD M=2 NWST + WD M=3 77.40% 53.28% 80.38% 59.89% 78.93% 67.00% 77.92% 65.72% 82.22% 64.56% 83.34% 63.67% TABLE I P ERFORMANCE OF WST + WD AND NWST + WD IN COMPARISON WITH FWT + GGD AND DT-CWT + GGD ON DATABASES 1 AND 2 100 DT-CWT + GGD WST + WD NWST + WD 90 80 Retrieval rate (%) 70 60 50 40 30 20 10 0 0 5 10 15 20 25 30 35 40 Texture # Fig. 6. Retrieval rates for each texture class of D1 individually. From left to right, with index in brackets: Bark0 (1), Bark6 (2), Bark8 (3), Bark9 (4), Brick1 (5), Brick4 (6), Brick5 (7), Buildings9 (8), Fabric0 (9), Fabric4 (10), Fabric7 (11), Fabric9 (12), Fabric11 (13), Fabric14 (14), Fabric15 (15), Fabric17 (16), Fabric18 (17), Flowers5 (18), Food0 (19), Food5 (20), Food8 (21), Grass1 (22), Leaves8 (23), Leaves10 (24), Leaves11 (25), Leaves12 (26), Leaves16 (27), Metal0 (28), Metal2 (29), Misc2 (30), Sand0 (31), Stone1 (32), Stone4 (33), Terrain10 (34), Tile1 (35), Tile4 (36), Tile7 (37), Water5 (38), Wood1 (39), and Wood2 (40). 68 DT−CWT + GGD WST + WD NWST + WD 66 64 62 Retrieval rate (%) Database D2 is considerably smaller than D1, but involves more variation within the classes, for instance, in terms of camera angle and deformation. Again, the WST greatly improves the retrieval performance. However, this time the regular WST along with shorter path lengths produces the best results with a significant margin of 7.11% in comparison with DT-CWT + GGD. Nevertheless, in both experiments, NWST + WD consistently outperforms comparable approaches not involving Scattering. 2) Impact of Directionality: In general, natural textures contain strongly directional features. Hence, it is often beneficial to incorporate a frame of higher directionality as it is done for the Rotated Wavelets approach, for example. The DTCWT + GGD method which mostly draws its benefits from increased directional selectivity is consistently and significantly outperformed by NWST + WD by at least 1.84% for D1 and 3.78% for D2, despite the same directionality properties. That is to say, there are reasons to believe that the success of the WST based methods is not exclusively due to increased directionality. 3) Impact of Textural Structure: To further analyze the capabilities of the involved methods, consider Fig. 6. It shows the retrieval rates for each class in Database D1 produced by DT-CWT + GGD as well as by WST + WD and NWST + WD with M = 2. The results produced by FWT + GGD were omitted in order to preserve comparability in terms of 60 58 56 54 52 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Standard deviation Fig. 7. Retrieval rates for D2, blurred with a Gaussian filter of different bandwidths. The standard deviation refers to the respective Gaussian distribution. A larger deviation is equivalent to a smaller bandwidth. directionality characteristics. DT-CWT + GGD performed well in comparison to the WST based techniques for the classes Brick5, Buildings9, Metal0 and Wood2 and badly for Bark9, Fabric4, Food0 and Leaves10. In general, it appears that DTCWT + GGD is most suitable to flat and (piecewise) smooth surfaces, while the two methods involving Scattering seem to be preferable for surfaces that are rather uneven, be it SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 9 because they are rough or because they are bent or dented. This correlates with the premise that Scattering coefficients are less sensitive to spatial deformations, since on the one hand, images of rough surfaces carry significant fractions of high-frequency components and on the other hand spatial deformation in the image is often caused by locally bending or denting the depicted material. More specifically, NWST + WD performed comparably well for the classes Bark0, Bark6, Flowers5, Terrain10 and regular WST for the classes Bark8, Fabric11, Grass1, Stone4. This suggests that WST + WD works better on fine structures while NWST + WD yields better results for coarse structures. This assumption is corroborated by the following experiment. Fig. 7 depicts the retrieval rates for the Database D2 after blurring it with a Gaussian filter of different bandwidths. Unsurprisingly, the retrieval performance declines with stronger blurring for all methods. However, it is evident that the retrieval rate for WST + WD declines significantly faster than the other two. This can be interpreted as a consequence of the distortion invariance of normalized Scattering coefficients. More importantly, this confirms that the regular WST method relies on fine features since they are most heavily affected by blurring. On the other hand, images of coarse patterns and flat and piecewise smooth surfaces are less affected by blurring such that the other two methods appear to be relatively robust. A PPENDIX A. Estimation of GGD parameters In this section, we describe an optimization algorithm for estimating the GGD parameters. Recall the log-likelihood of a GGD p(x|α, β) for a set of realizations {x1 , . . . , xN } in the following equation V. C ONCLUSION AND O UTLOOK Setting the right hand side of Eq. (33) be equal to 0 uniquely determines v u N uβ X β ∗ |xi |β (35) α = t N i=1 In this work, we propose a subband histogram FE method based on the statistics of the WST of a texture image. Our derivation and analysis demonstrate how to extract statistical features from a WST representation of a texture image by means of matching a Weibull model via ML and how to measure the similarity of the respective feature vectors via the KLD. The proposed techniques come in handy when it comes to reducing the enormous degree of redundancy introduced by the WST since they represent each subband by as much as two real numbered parameters. The experiments show that the proposed method can outperform comparable algorithms based on filter bank transforms in terms of retrieval accuracy when designed as a CBIR system. Regular WST coefficients seem to work better on fine structures and Normalized WST coefficients on coarse structures. Since approaches for rotation invariant Scattering representations are already available [16], [17], it appears feasible to extend the presented ideas towards a rotation invariant CBIR framework which would allow for more general problem settings. The summation (30) of the KLDs was partly motivated by assuming statistical independence of the subbands. However, neither does statistical independence hold for the WST subbands, nor for the subbands of other filterbank transforms, in practice. Accounting for statistical dependence in the model can lead to significant improvements in efficiency [10], [12]. Another approach could be based on recalling that analyzing the subband histograms is also motivated biologically. Thus, weight factors could be introduced to the summands of (30) in order to account for the different significance of each subband. fGGD (α, β) = ln LGGD (x1 , . . . , xN |α, β) X β N (31) 1 |xi | β . − lnΓ − =N ln 2α β α i=1 Then the parameters (α∗ , β ∗ ) for approximating the distribution of the realizations can be identified by solving the following maximization problem (α∗ , β ∗ ) = argmax fGGD (α, β). (32) (α,β) In this work, we employ a numerical optimization approach to solve the maximization problem given in Eq. (32). We firstly take the first derivative of fGGD as N X N ∂fGGD = α−β−1 β |xi |β − ∂α α i=1 (33) and β N ∂fGGD N N Γ0 (1/β) X |xi | |xi | = + 2 − ln . (34) ∂β β β Γ(1/β) i=1 α α for β > 0. Let us denote the Digamma function by Ψ(z) = Γ0 (z)/Γ(z), similarly, setting the right hand side of Eq. (34) to be equal to 0 leads to β N β X |xi | |xi | Ψ(1/β) − ln = 0. (36) 1+ β N i=1 α α By substituting Eq. (35) into Eq. (36), we have N N P β P |xi |β |xi |β ln |xi | ln N Ψ(1/β) i=1 i=1 1+ − + = 0, (37) N P β β |xi |β i=1 which characterizes the critical points of the function fGGD in terms of β for α = α∗ . Iterative numerical algorithms, such as the Newton-Raphson method, can be used to solve the above equation for β, cf. [9]. Due to the summand −N ln Γ (1/β), however, we can not guarantee concavity of fGGD . Nevertheless, a choice of good initial point β0 , as proposed in [9], [25], can increase the probability of finding the global maximum of the function fGGD , i.e. via solving the following equation for β0 via interpolation, N 2 P N |xi | 2 (Γ(2/β0 )) i=1 = . (38) N P Γ(1/β0 )Γ(3/β0 ) 2 |xi | i=1 SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 10 Once the solution β ∗ is found, it can be substituted into Eq. (35) for α∗ . B. Estimation of Weibull parameters In this section, we employ a numerical optimization approach to estimate the Weibull parameter. The log-likelihood amounts to fW D (k, λ) = ln LW D (x1 , . . . , xN |k, λ) = N k ln λ + N ln k + (k − 1) N X ln xi − i=1 ∗ N X (λxi )k . (39) i=1 ∗ The optimal parameters (k , λ ) are identified by solving (k ∗ , λ∗ ) = argmax fW D (k, λ), (40) (k,λ) by employing a similar technique as demonstrated in Appendix A. The first derivatives of fW D with respect to k and λ are computed as ! N N X N X ∂fW D k k = + ln xi + N − λ xi ln λ ∂k k i=1 i=1 (41) N X k k −λ (ln xi )xi i=1 and ∂fW D =k ∂λ N X N − λk−1 xki λ i=1 ! . (42) Setting Eq. (42) to be equal to 0 uniquely determines λ∗ for k > 0. Solving for λ yields !− k1 N X 1 λ∗ = xk . (43) N i=1 i Substituting (43) in Eq. (41) and equating it with 0 characterize the critical points of fW D in terms of k for λ = λ∗ , i.e. PN N N i=1 (ln xi )xki N X + ln xi − = 0. (44) PN k k i=1 xi i=1 Since the function fW D as defined in Eq. (39) is concave with respect to k > 0, a numerical algorithm such as the NewtonRaphson method can be used to obtain k ∗ . Substituting this solution into Eq. (43) yields λ∗ . R EFERENCES [1] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features for image classification,” Systems, Man and Cybernetics, IEEE Transactions on, no. 6, pp. 610–621, 1973. [2] H. Tamura, S. Mori, and T. Yamawaki, “Textural features corresponding to visual perception,” Systems, Man and Cybernetics, IEEE Transactions on, vol. 8, no. 6, pp. 460–473, 1978. [3] T. Randen and J. Husoy, “Filtering for texture classification: a comparative study,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 21, no. 4, pp. 291–310, Apr 1999. [4] J. Smith and S.-F. Chang, “Transform features for texture classification and discrimination in large image databases,” in Image Processing, 1994. Proceedings. ICIP-94., IEEE International Conference, vol. 3, Nov 1994, pp. 407–411 vol.3. [5] B. S. Manjunath and W.-Y. Ma, “Texture features for browsing and retrieval of image data,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 18, no. 8, pp. 837–842, 1996. [6] I. Sumana, M. Islam, D. Zhang, and G. Lu, “Content based image retrieval using curvelet transform,” in Multimedia Signal Processing, 2008 IEEE 10th Workshop on, Oct 2008, pp. 11–16. [7] G. Van de Wouwer, P. Scheunders, and D. Van Dyck, “Statistical texture characterization from discrete wavelet representations,” Image Processing, IEEE Transactions on, vol. 8, no. 4, pp. 592–598, 1999. [8] N. Vasconcelos, “On the efficient evaluation of probabilistic similarity functions for image retrieval,” Information Theory, IEEE Transactions on, vol. 50, no. 7, pp. 1482–1496, July 2004. [9] M. N. Do and M. Vetterli, “Wavelet-based texture retrieval using generalized gaussian density and kullback-leibler distance,” IEEE Trans. Image Processing, vol. 11, no. 2, pp. 146–158, 2002. [10] ——, “Rotation invariant texture characterization and retrieval using steerable wavelet-domain hidden markov models,” Multimedia, IEEE Transactions on, vol. 4, no. 4, pp. 517–527, 2002. [11] D.-Y. Po and M. Do, “Directional multiscale modeling of images using the contourlet transform,” in Statistical Signal Processing, 2003 IEEE Workshop on, Sept 2003, pp. 262–265. [12] G. Tzagkarakis, B. Beferull-Lozano, and P. Tsakalides, “Rotationinvariant texture retrieval with gaussianized steerable pyramids,” Image Processing, IEEE Transactions on, vol. 15, no. 9, pp. 2702–2718, Sept 2006. [13] S. Choy and C. Tong, “Statistical wavelet subband characterization based on generalized gamma density and its application in texture retrieval,” Image Processing, IEEE Transactions on, vol. 19, no. 2, pp. 281–289, Feb 2010. [14] M. Allili, “Wavelet modeling using finite mixtures of generalized gaussian distributions: Application to texture discrimination and retrieval,” Image Processing, IEEE Transactions on, vol. 21, no. 4, pp. 1452–1464, April 2012. [15] S. Mallat, “Group invariant scattering,” Communications on Pure and Applied Mathematics, vol. 65, no. 10, pp. 1331–1398, 2012. [16] L. Sifre and S. Mallat, “Combined scattering for rotation invariant texture analysis,” in European Symposium on Artificial Neural Networks, 2012. [17] ——, “Rotation, scaling and deformation invariant scattering for texture discrimination,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 1233–1240. [18] J. Bruna, “Scattering representations for recognition,” Ph.D. dissertation, CMAP, Ecole Polytechnique, 2012. [19] J. R. Bergen and M. S. Landy., “Computational modeling of visual texture segregation,” in Computational Models of Visual Processing, ser. A Bradford book, M. Landy and J. Movshon, Eds. MIT Press, 1991. [20] S.-H. Cha, “Comprehensive survey on distance/similarity measures between probability density functions,” International Journal of Mathematical Models and Methods in Applied Sciences, vol. 1, no. 4, pp. 300–307, 2007. [21] S. Mallat, A Wavelet Tour of Signal Processing - The Sparse Way. Elsevier Science, 2008, pp. 89–376. [22] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1872–1886, 2013. [23] J. And`en and S. Mallat, “Deep scattering spectrum,” IEEE Transactions on Signal Processing, 2013. [24] I. Waldspurger, A. d’Aspremont, and S. Mallat, “Phase recovery, maxcut and complex semidefinite programming,” Mathematical Programming, pp. 1–35, 2012. [25] S. Mallat, “A theory for multiresolution signal decomposition: the wavelet representation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 11, no. 7, pp. 674–693, Jul 1989. [26] C. Bauckhage, “Computing the Kullback-Leibler Divergence between two Weibull Distributions,” ArXiv e-prints, 2013. [27] I. W. Selesnick, R. G. Baraniuk, and N. C. Kingsbury, “The dual-tree complex wavelet transform,” Signal Processing Magazine, IEEE, vol. 22, no. 6, pp. 123–151, 2005. [28] M. Kokare, P. K. Biswas, and B. N. Chatterji, “Texture image retrieval using new rotated complex wavelet filters,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 35, no. 6, pp. 1168–1178, 2005.

1/--pages