[IEEE Internet-Based Systems (SITIS 2009) - Marakesh, Morocco (2009.11.29-2009.12.4)] 2009 Fifth...

Image Representation with Bag of bi-SIFT

Ignazio Infantino, Filippo VellaConsiglio Nazionale delle Ricerche

ICARViale delle Scienze ed. 11,

Palermo, ITALYEmail: [email protected]

[email protected]

Giovanni Spoto, Salvatore GaglioUniversita degli Studi di Palermo,

DINFOViale delle Scienze ed. 9,

Palermo, ITALYEmail: [email protected]

[email protected]

Abstract—Local features are widely adopted to describevisual information in tasks for image registration and nowadaysthe most used and studied feature is SIFT (Scale InvariantFeature Transform)[1] for the great local description powerand the reliability with different acquisition condition.We propose a feature that is based on SIFT features andtends to capture larger image areas that can be used forsemantic based task. These features are called bi-SIFT for theirresemblance with textual bigrams. We tested the capability ofthe proposed representation with Corel dataset. In particularwe calculated the most representatives features through aclusterization process and used these value according to thevisual terms paradigm. Experiments on the representation ofsets of images with the proposed representation are shown.Although preliminary the results appear to be encouraging.

Keywords-Image Retrieval, Image analysis, Image represen-tation, SIFT

I. INTRODUCTION

Local feature are usually adopted to describe visualinformation [2] [3] [4] [5]. Among them SIFT (ScaleInvariant Feature Transform) features [1] are widelyadopted to represent visual information for tasks involvingregistration [6][7] and image matching. These features areinvariant to variation of scale, image rotation and affinevariation of viewpoint.As for local representation, these features are representationof points according to the neighbor pixel in gaussianpyramid but lack to capture a global information bound tothe higher level semantic of a visual scene.A descriptor aimed at capturing semantic values - relatedto objects and ”large scale” pattern - should representcharacteristics that are invariant for a great gamma oftransformation and should assure to capture values that areless affected by different views and grab parameters.SIFT are a good starting point since if it is invariantto rotation and scale variation and at some extent theseproperties are the ”guarantee” that in different scenes thesame object maintains the same representation. Changeof luminance are compensated by the SIFT representationas gradient histogram that is invariant to variation inluminance, change of scale are compensated by the point

selection, and change of rotation are compensated by theorientation normalization. Unfortunately local features, asSIFT, are bound to local regions and do not cover relevantportion of images. We propose a technique to composeSIFT features to create more abstract features related towider areas in images. From local feature values related amore global information is captured.The idea is similar to what is done in natural languageprocessing when single words are composed, in bigrams, toform a new symbolic representation that is able to cover alarger language semantics than a single word.Works employing local descriptors to attempt to describeimage content have been proposed by Csurka [8], Sivic andZisserman [9] and Hare[10].Csurka et al. [8] evaluate all the SIFT features for thedataset, they build a cluster distribution with k-means andrepresent the new images counting how many featuresfall in the chosen clusters. These new features are usedas representation to classify test objects. Authors showpromising results for the classification of 7 objects. SimilarlySiciv and Zisserman[9] use SIFT descriptor clustered andused as words to apply text retrieval techniques to matchingobjects in keyframe of video sequences. To validate thematching among SIFT features a property of spatialconsistency is checked. In particular a search area formedby the 15 nearest spatial neighbors is defined and a matchbetween two regions centered on a key point is validatedif a key point, among the nearest 15 of the starting pointmatches with a key point among the 15 nearest pointaround the target one. Matches that fall outside this frameare not considered as matching score.Hare et al. [10] applied a SIFT based representation toemploy cross language latent semantic indexing.Ke and Sukthankar[11] adopt SIFT local descriptors tocreate a more general descriptor. The proposed descriptoris built considering the Principal Component Analysis tolinearly project high-dimensional data, given by the SIFTdescriptors, to a low-dimensional data in the principalcomponent space. Each representation deals with a singleimages key point.

2009 Fifth International Conference on Signal Image Technology and Internet Based Systems

978-0-7695-3959-1/09 $26.00 © 2009 IEEE

DOI 10.1109/SITIS.2009.54

287

As further reference in literature, attempts to describeobjects with affine invariant descriptors can be foundin [12][13]. Lazebnik et al. [12] propose descriptorsthat shown invariance to affine transformation identifyingmatches among couple of images depicting the same subject(authors concentrated their test on a set of six hundredsbutterflies), the initialized descriptors are matched againsta larger validation set. Affine region are spotted with aLaplacian blob detector based on Lindberg descriptors[4].The found regions are represented with multiple typesof image information as spin images [14]and RIFTdescriptors[15].Brown et al. [13] propose a family of features which usegroups of interest points to form geometrically invariantdescriptors. Interest point in the scenes are located at theextrema of the Laplacian of the image in scale-space. Pointsare described with a family of 2D transformation invariantfeatures based on groups of interest points. In particulargroups of interest point which have in the nearest neighborsamong 2 and 4. The 2xn parameters transformation to acanonical frame are computed. The descriptor is formedresampling the region local to the interest points in thecanonical frame. Hough Transform is used to find a clusterof features in 2D transformation space. RANSAC is adoptedto improve the 2D transformation estimate.

We considered that SIFT is a transformation very useful tocapture local information generating hundreds of key pointsin a generic image, and it is particularly suitable for imageregistration or image matching. In this work, we propose anovel representation achieved composing SIFT descriptorsto create a reduced number of points in an image and that,at the same time, allow a more abstract descriptors. Thenew descriptors cover a larger region instead of single pointand that turns to be semantically more relevant than theoriginal patch of keypoint covered by SIFT descriptors. Therepresentation generates features that tend to be semanticallynearer to scene objects and image tags.The new feature maintains the property of keypoints inrobustness against variations in illumination and changes ofscale and, for the way SIFT feature are composed, maintainsalso the invariance against rotations.We show in the experiments that this composite feature,called bi-SIFT, generated from SIFTs allows to have areliable representation improving solutions based on SIFT.The paper is organized as follows: Section II describes theidentification of keypoints and creation of SIFT features,Section III shows how features are computed and howimages are represented with the proposed feature. In SectionIV the results of experiment setup are shown. Finally in thesection V conclusions are drawn.

II. SIFT FEATURES

SIFT have been proposed by Lowe [1] for detecting pointsthat are invariant against changes in illumination, imagescaling, image rotation. These features follow the researchof local features such as corner detectors by Harris[2] andLucas [3], scale-space theory by Lindberg [4] and edgedensity by Shi[5] . In particular, Lowe created local featuresaiming at identifying features less affected by geometric (asrotation, translation, scale change) and intensity variation(linear variation).To automatically detect points that are more robust againstthese deformations, SIFTs are extracted creating a pyramidof Gaussian image transformation at different scales, findin the pyramids the peaks independent by variation in scaleand normalizing features according image dimension androtation.In particular, the Gaussian Pyramid is evaluated for thesample image and from each layer of pyramid the differenceis calculated to obtain the Difference of Gaussian (DOG)pyramid. To get the local extrema in the images, the neighborpoints in the same scale and in different scales are consid-ered, and points are retained if they are greatest among theneighbor point in the same image and in all the other scales.Similarly are retained points that are the smallest. This stepis called (keypoint localization).For each candidate point the location and orientation areevaluated. Considering that points with low contrast aredifficult to detect and points along edges are very unstablewhen noise is present, these points are discarded. Once thepoint is selected, an orientation histogram formed with thegradient orientations of sample points within a region aroundthe keypoint is formed. The orientation histogram has 36bins covering the 360 degree range of orientations. Eachsample added to the histogram is weighted by its gradientmagnitude and by a Gaussian-weighted circular windowwith a σ that is 1.5 times that of the scale of the keypoint.The modes of the histograms are considered as the dominant

Figure 1. SIFT feature descriptor

orientation for the given keypoints. All the orientation withinthe 80% of the maximum values are retained.The detected keypoints are represented forming a descriptorwith an orientation histogram on 4x4 neighbor pixels. Sincethe histograms are populated considering orientation referred

288

to the largest gradient magnitude, the feature is invariantagainst rotations. Each histogram contains 8 bins each anda descriptor contains 4 histograms around the keypoint. TheSIFT feature is then composed by 4x4x8 = 128 elements.The normalization of histogram allows robustness againstillumination changes.

III. PROPOSED FEATURE AND IMAGE REPRESENTATION

SIFT are reliable features and are robust against typicalvariation in picture viewpoint position. Notwithstanding theyare local features and cover local properties of objectsand scenes. To create a feature as robust as SIFT andable to describe wider areas of images and scenes weconsider the composition of a set of keypoints in a regionof image. The new feature is composed taking into accountthe keypoints falling in a circular region centered in akeypoint and delimited by a fixed radius. The new featurewill represent in a more abstract way a larger piece of imageor a complex pattern allowing to capture large portion ofobjects or scene invariant characteristics. The size of regionsdescribed by this novel feature (called bi-SIFT) is drivenby an empirically fixed parameter. This parameter is calledspatial bandwidth as it is coherent with a spatial clusteringin Mean Shift theory[16].The feature is built considering the keypoints falling in aregion centered on a keypoint. Inside this image portion oneor more keypoints can be found. If only the central pointis falling in the region, meaning that the selected part ofimage captures a region with few relevant points, the bi-SIFT feature is not generated. In the other case when morepoints fall in the region around the keypoint, a compositedescription of region is considered. Not all the keypoints aretaken into account as, in this case, a variable size descriptorwould be created. A selection is made instead preservingthe most relevant and stable information in the coveredareas. The property in SIFT descriptors that most matcheswith stability and robustness of SIFT features against imagetransformation is the highest gradient magnitude that canbe evaluated as the module of the main orientation in therepresented image patch. The SIFT descriptors are thenordered according to their highest gradient magnitude andSIFT descriptors with highest values are retained (in thiscase we set this value to two). The selected points arethe most stable against variation in capture condition andbetter characterize scene invariant values. The new featureis formed by the juxtaposition of the SIFT representation ofthe selected points. A schematic representation of the newfeature is shown in figure 2.This feature represents a wider area than SIFT descriptors,

maintaining invariance against change of viewpoint.Single SIFT features are invariant against variation in scale.If a change of scale occurs the same region will be describedby similar SIFT descriptors in the scaled images. If twokeypoints are selected to form a bi-SIFT feature in the

Figure 2. Example of bi-SIFTcomposition

original image, given that change of scale maintains the twopoints inside the spatial bandwidth, the same two keypointswill be selected to form the bi-SIFT feature for the sameregion in the scaled image, providing invariance of bi-SIFTagainst change of scale.If the image is rotated or the viewpoint is changed, a givenregion will produce an approximation of the original bi-SIFT descriptor. Single points described with SIFT featuresare invariant, for SIFT properties, to rotation. A couple ofkeypoints forming a bi-SIFT feature is mapped by rotationin two different position but since the gradient magnitude isnot affected by rotation, the sorting of feature will give thesame keypoints inside the given region forming the samebi-SIFT as before rotation and assuring invariance againstrotations.For these properties, bi-SIFT descriptors are reliable indescribing portion of objects and relevant patterns in scenes.SIFT descriptors are related to relatively small areas and ifcouple of points are accidentally near in an images (e.g.a point from object and a point from background) theywill be greatly affected by change of viewpoint or will bedifficult to retrieved in images depicting the same object ina different scene with low probability. So the recurrence ofbi-SIFT feature asserts the presence of a given object or acharacteristic pattern for a given scene. Some examples of

Figure 3. Example of bi-SIFT in indoor scenes

bi-SIFT are shown in 3 for two images of the indoor classof the Corel dataset.The proposed features, for their properties, can be alsoprofitably used to find reliable points for matching between

289

images. An example of matching between images in figure4. Images have been acquired with different points and thequality of matched points show the good performance of thebi-SIFT also for image registration tasks. 1

Figure 4. Matching example with bi-SIFT

A. Bag of bi-SIFT

For the properties of the proposed feature it can besuccessfully used to represent images in large data set. Aproblem can raise when the number of images grows up andthe number of features extracted from each image can makethe problem cumbersome. For this reason we choose to adoptthe representation based on visual terms [17]. The mainadvantage of this technique is to consider the most relevantfeatures and represent values according to the more frequentand intrinsically more significant points in the representationspace.Here we propose the application of a technique based onvisual terms where the feature to cluster is given by theabove described bi-SIFT feature. The underlying hypothesisis that couples, or bigrams, of SIFT are suitable to describeareas larger than the single local features expressed by SIFTand these features, collecting information at a level betweenobjects and pixels, are good candidates for the reduction ofthe semantic gap.The conceived features are used to find reliable points inimages for matching of local part of images and furthermoreto represent the visual content of images. The set of localfeatures is used to create a dictionary of features ( called inliterature visual terms) for the description of a generic visualcontent. The set of values used as symbolic descriptor areevaluated clustering the bi-SIFT, represented as vectors of256 real numbers, and extracting the centroids as funda-mental values. For each category the corresponding set ofbisift are considered and the visual terms for each categoryis added to the global set of visual terms.A generic image is therefore represented as a bag of visualterms and different images will be represented by vectorswith different distribution in their components.The process that extracts visual terms as centroids of featureclusters allows to reduce the presence of noise in the fea-tures reducing the irrelevant information in representation.Extracted visual terms are collected forming a dictionary

1Images are available at http://www.robots.ox.ac.uk/∼vgg/data/data-aff.html

used to represent any visual content. For each image in thedataset the set of SIFT points is extracted, the point arecoupled to form bigram-of-sift as described above. For eachbi-SIFT the nearest feature in the dictionary is found. Foreach image a vector with a cardinality equal to the size ofvisual dictionary, is filled with the visual terms found in it.

Figure 5. Example of Visual Terms for an image depicting a computer

The visual dictionary is evaluated considering all the bi-SIFT inside a given category and clustering values for thebi-SIFT elaborated with the above described approach. Inparticular, clustering of features is accomplished through aclustering technique based on the feature density estimation.This technique is called Mean Shift since the algorithmiteratively follows the mean vector along the gradient ofdensity[16]. The algorithm is based on bandwidth parameterthat drives the clusterization process and create a number ofclusters according to the density estimation.For higher data dimension an improved version of MeanShift has been proposed by the same authors called AdaptiveMean Shift. According this second algorithm data are parti-tioned with Local Sensitive Hashing (LSH) to get a reduceddimension representation of data. On the hashed data theMean Shift algorithm is applied [18].The clusterizationparameters are picked empirically.The set of all the visual terms got from any category, throughthe clusterization process, is added to the global dictionaryand will be used to describe an image in the dataset anda generic image too. The representation of an image isachieved with a set of values filled with tij values definedbelow:

tij =nij

nilog

N

Nj(1)

where nij is the number of occurrence of j − th visualterm in the i − th image, ni is the number of terms in thei − th image, N is the number of images in the data setand Nj is the number of occurrence of the j − th visualterm. This processing multiplies the term frequency nij/ni

for the inverse document frequency log(N/Nj) giving anhigher value to terms that are present in shorter visualdocument and to less frequent visual terms.An advantage using bi-SIFT is the possibility to store stablecomposition of SIFT keypoints. For example if two SIFTkeypoints are near and related to the same object or thesame pattern in the scene, when the view changes or the

290

object is slightly moved in the scene, the bi-SIFT will notchange considerably. On the other side, if the couple ofSIFT keypoints forming the bi-SIFT come from differentobjects or came from an object and background, there isa low probability that they will be near in an other scene.In particular, during the clustering process, if no similarfeature is found, feature will not contribute in formingclusters, since they are rare in images representation anddefinitely will not affect global representation.

IV. EXPERIMENTAL RESULTS

The representation with bag of bi-SIFT has been testedwith images from the Corel data set. It has been widely usedas benchmark in image classification and annotation tasks(e.g.[17] [19]). It consists of 5000 images divided in a setof CD containing images with homogenous categories. Someimage from category ”computer” and ”tiger” are shown inthe following figures.

Figure 6. Example of Corel Images form category Computer and categoryTiger

To reduce the experiments time and produce a subsampledtest, a subset of 765 images has been extracted from theCorel dataset. SIFT features have been calculated and havebeen coupled to form the corresponding bi-SIFT features.Each image in the dataset has been processed to discard thecolor information and extract the SIFT features. Althoughcolor channel brings a lot of information in this case weconsider features that involve just luminance features withthe possibility to add chroma information in a secondmoment. To evaluate the value of spatial bandwidth (seesection III), related to the image area covered by a bi-SIFTfeature a clusterization process has been tested with differentvalues of spatial bandwidth. Results of experiments areshown in figure 7. The number of clusters versus the valueof spatial bandwidth is shown. The three plots inside thegraph correspond to different values of multivariate kernelbandwidth that drives the clusterization process with theAdaptive Mean Shift algorithm. Values of the multivariatekernel bandwidth are 0.060, 0.065 and 0.070. The graphshows that the larger is the spatial bandwidth, the less are thenumber of clusters. With lower values of spatial bandwidth,the number of cluster is, somehow, constant and it shows amaximum value, when the spatial bandwidth is around 7.

Features extracted are clustered with Adaptive Mean Shiftalgorithm using a kernel bandwidth of 0.065 and varying the

Figure 7. Number of Clusters versus the spatial bandwidth with AdaptiveMean Shift Clustering

Figure 8. Number of not zero values versus the distance

spatial bandwidth. The centroids got from clusterization areused as visual terms allowing to describe images with a setof symbolic features. The process, as described in sectionIII-A, creates a representation with each row correspondingto an image represented according to the extracted set ofvisual terms. The representation of images with the tf-idfis function of the number of visual terms, of the chosenfeatures and of the distance used to match feature with visualterms.In figure 8 is shown how the population of the matrix, havingimages bound to rows and visual terms bound to columns,depends on the value of distance threshold to match agiven feature with a visual term. If distance threshold istoo low, there are few matches among image bi-SIFT andthe visual terms. On the other side if threshold is too highmany matches are found and all these values are cut by the

291

entropic filtering (see eq. 1). The plot is a gaussian curveas shown in figure 8. For different value of bi-SIFT spatialbandwidth the gaussian curve is slightly shifted. When thespatial bandwidth is set to 7, the distance threshold thatallows the least number of not zero value is 0.6.In figure 9 the Recall versus Precision is shown for twovalues of spatial bandwidth compared with SIFT curve. Thebi-SIFT curve correspond to values of spatial bandwidthequal to 2 and 12. The curve for the other value of bandwidthare among these two ones and the SIFT curve is very nearto the curve with value equal to 2.

Figure 9. Recall versus Precision

V. CONCLUSIONS

An approach aiming at creating an object level featurehas been presented. The feature called bi-SIFT is based onthe well-know SIFT feature. To create a more abstract rep-resentation features SIFT keypoint descriptors are composedto form a feature that resembles the text bigrams and thattends to capture a larger part of image and can be used tosemantic oriented tasks. Experiments show promising resultsand future works will include the application in task of sceneunderstanding and automatic image annotation.

REFERENCES

[1] David G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of ComputerVision, vol. 60, pp. 91–110, 2004.

[2] C. Harris and M. Stephens, “A combined corner and edgedetection,” in Proceedings of The Fourth Alvey VisionConference, 1988, pp. 147–151.

[3] Bruce D. Lucas and Takeo Kanade, “An iterative imageregistration technique with an application to stereo vision,”1981.

[4] T. Lindberg, “Effective scale: A natural unit for measuringscale-space lifetime,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 15, no. 10, pp. 1068–1074,1993.

[5] J. Shi and C. Tomasi, “Good features to track,” in Pro-ceedings of the Conference on Computer Vision and PatternRecognition, June 1994, pp. 593–600.

[6] Samuel Cheng, Vladimir Stankovic, and Lina Stankovic, “Im-proved sift-based image registration using belief propagation,”Acoustics, Speech, and Signal Processing, IEEE InternationalConference on, vol. 0, pp. 2909–2912, 2009.

[7] Y. Fan, M. Ding, Z. Liu, and D. Wang, “Novel remotesensing image registration method based on an improvedSIFT descriptor,” in Society of Photo-Optical InstrumentationEngineers (SPIE) Conference Series, Nov. 2007, vol. 6790 ofSociety of Photo-Optical Instrumentation Engineers (SPIE)Conference Series.

[8] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray,“Visual categorization with bags of keypoints,” in In ECCVInternational Workshop on Statistical Learning in ComputerVision, 2004.

[9] J. Sivic and A. Zisserman, “Video google: A text retrievalapproach to object matching in videos,” in Proceedings of theInternational Conference on Computer Vision, 2003, vol. 2,pp. 1470–1477.

[10] J. Hare, P. Lewis, P. Enser, and C. Sandom, “A linear-algebraic technique with an application in semantic imageretrieval,” in Proceedings of the International Conference onImage and Video Retrieval, 2006, pp. 31–40.

[11] Yan Ke and Rahul Sukthankar, “Pca-sift: A more distinctiverepresentation for local image descriptors,” in Proc. ofComputer Vision and Pattern Recognition (CVPR) 04, 2004,pp. 506–513.

[12] S. Lazebnik, C. Schmid, and J. Ponce, “Semi-local affineparts for object recognition,” in Proceedings of BMVC, 2004.

[13] M. Brown and D. Lowe, “Invariant features from interestpoint groups,” in Proceedings of BMVC 2002, 2002.

[14] A. Johnson, “Using spin images for efficient object recog-nition in cluttered 3d scenes,” IEEE Transaction on PatternAnalysis and Machine Intelligence, vol. 21(5), pp. 433–449,1999.

[15] S. Lazebnik, C. Schmid, and J. Ponce, “A sparse texturerepresentation using local affine regions,” 2004, TechnicalReport, CVR-TR-2004-01, Beckam Institute, University ofIllinois.

[16] Dorin Comaniciu and Peter Meer, “Mean shift: A robustapproach toward feature space analysis,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 24, pp.603–619, 2002.

[17] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei,and M.I. Jordan, “Matching words and pictures,” Journal ofMachine Learning Research, vol. 3, pp. 1107–1135, 2003.

292

[18] B. Georgescu, I. Shimshoni, and P. Meer, “Mean shiftbased clustering in high dimensions: A texture classificationexample,” in Proceedings of IEEE International Conferenceon Computer Vision, 2003.

[19] F. Vella, C.-H. Lee, and S. Gaglio, “Boosting of maximalfigure of merit classifiers for automatic image annotation,” inProc. of Internation Conference on Image Processing, 2007.

293

[IEEE Internet-Based Systems (SITIS 2009) - Marakesh, Morocco (2009.11.29-2009.12.4)] 2009 Fifth...

Documents

Transcript of [IEEE Internet-Based Systems (SITIS 2009) - Marakesh, Morocco (2009.11.29-2009.12.4)] 2009 Fifth...