Hierarchical Modeling of Multidimensional Data in Regularly Decomposed Spaces: Synthesis and Perspective

2020·Arxiv

Abstract

Abstract

This fourth and last tome is focusing on describing the envisioned works for a project that has been presented in the preceding tome. It is about a new approach dedicated to the coding of still and moving pictures, trying to bridge the MPEG-4 and MPEG-7 standard bodies.

The aim of this project is to de(ine the principles of self-descriptive video coding. In order to establish them, the document is composed in (ive chapters that describe the various envisioned techniques for

Based on the techniques of multiresolution computing, it is proposed to develop an image segmentation made from piecewise regular components, to compute attributes on the frame and the rendering of so produced shapes, independently to the geometric transforms that can occur in the image plane, and to gather them into perceptual groupings so as to be able in performing recognition of partially hidden patterns.

Due to vector quantization of shapes frame and rendering, it will appear that simple shapes may be compared to a visual alphabet and that complex shapes then become words written using this alphabet and be recorded into a dictionary. With the help of a nearest neighbour scanning applied on the picture shapes, the self-descriptive coding will then generate a sentence made from words written using the simple shape alphabet.

Keywords : perception models for arti(icial vision, mono- and multispectral image analysis, multiresolution analysis, surface manifolds, localization, calculus of attributes invariant to geometric transformations, indexing, similarity measure, perceptual groupings, self-descriptive coding, visual primitive dictionaries, statistical pattern recognition, structural pattern recognition, generalized images.

Fields : Modeling and simulation, Algorithm and data structure, Image processing, Computer-based vision and pattern recognition, Learning

Table of contents

Introduction........................................................................................................................................................................5 1. Description of the work.............................................................................................................................................7 2. Image segmentation....................................................................................................................................................9

2.1. Introduction...........................................................................................................................................................9 2.2. Experimental foundations of a regular piecewise decomposition...................................................9

2.2.1. Taking in account the irregularities of a model...............................................................................9 2.2.2. Adjacent linear model (itting................................................................................................................10 2.2.3. Generation of continuous models of any order.............................................................................10

2.3.1. Digital representation of an image.....................................................................................................11 2.5.2. Image least squares linear estimation..............................................................................................14 2.3.3. Irregularities detection and data set dividing...............................................................................18 2.3.4. Image piecewise linear decomposition............................................................................................19 2.3.5. Two image pieces aggregating at the next upper order............................................................22

3. Calculus of descriptors............................................................................................................................................29

3.1. Describing visual shapes................................................................................................................................29 3.2. Descriptors associated to the shape domains........................................................................................29

3.2.1. Generalized moments of a shape domain.......................................................................................29 3.2.2. Moments calculus over a cellular representation........................................................................30 3.2.3. Translation of moments to the domain gravity center..............................................................30 3.2.4. Rotation of moments in the domain own coordinate system.................................................31 3.2.5. Invariant characteristics of the surface pieces domains...........................................................33

3.3.2. Translation of the analytic expressions towards the piece gravity centers.......................34 3.3.3. Description of a shape in the coordinate system of its tangent plane.................................35 3.3.4. Shape own coordinate system and reduction of its analytic expression............................36 3.3.5. Invariant characteristics of the shapes associated to the surface pieces............................37

4. Building perceptual groupings.............................................................................................................................39

4.1. Introduction........................................................................................................................................................39 4.2. Shape aggregation.............................................................................................................................................40 4.3. Facial recognition example............................................................................................................................41 4.4. Shape aggregation algorithms......................................................................................................................43

4.4.1. Aggregation of shape rendering descriptors.................................................................................43 4.4.2. Aggregation of shape domain descriptors......................................................................................45

5. Building visual dictionaries...................................................................................................................................47

5.1. Vector quantization of shape descriptors................................................................................................47 5.2. Visual alphabet of simple shapes................................................................................................................50 5.3. Visual dictionary of complex shapes.........................................................................................................51 5.4. Synthesis of simple and complex shapes.................................................................................................52

6. Image and video coding..........................................................................................................................................53

6.1. Optimal path along the blocks of a regularly divided image............................................................53 6.2. Visiting path for the objects of an image..................................................................................................54 6.3. Visual syntax and phraseology.....................................................................................................................56 6.4.Multidimensional sorting of a shape base................................................................................................56

Conclusion.........................................................................................................................................................................57 Bibliographic references..............................................................................................................................................59

Introduction

This fourth and last tome is focusing on describing the envisioned works for a project that has been presented in the preceding tome. It is a project about a new approach dedicated to the coding of still and moving pictures, trying to bridge the MPEG-4 and MPEG-7 standard bodies. The aim of this project is to de(ine the principles of self-descriptive video coding.

The envisioned paradigm relies on a layered representation model that gives structure to the information belonging to still and moving pictures, that is the information made from pixel rows, geometric structures, relations shared by regions and visual objects. These different kinds of information are complementary to each other and can be retrieved one from the another one according to a progressive approach naturally leading to a pyramidal organization of visual information.

The (irst aim is then to develop an intermediate level representation model that allows to handle the picture and video content starting from their homogeneous components. A geometrically representation is also de(ined for these various components.

These components are visual primitives established by learning from the captured content, then recorded into a dictionary. They can be managed independently from each other or gathered in order to make easier editing and content-based searching.

Shape descriptors are used for coding, recording and indexing various visual primitives and eventually the full content of still and moving pictures.

1. Description of the work

The work plan is divided into several steps enabling to develop the necessary technologies for achieving the aims of the research project.

It includes the following steps:

The (irst step will be focused on the image structure interpreted as piece-wise regular surfaces and will try to (ind the order of these surface pieces in distinguishing the singularities from the encountered irregularities. It will enable to develop an image segmentation, not driven by the search of adjacent points according to a metric topology, but rather by favoring the ultra-metric relations between regions coming from the piece-wise segmentation: the notion of visual object is then replaced by the viseme one which is a smaller homogeneous and compact unit.

The second step is looking for measuring these new visual shapes by using generalized moments, that enable to localize and to provide measures invariant to geometric transformations that the support of these simple shapes may suffer inside the image plane, and the reduction of the analytic expressions of the shape rendering in the tridimensional space linked to the corresponding surface piece, knowing that a single viseme will include as many tridimensional shapes as there are spectral bands in the image.

The following step will be interested in the compound shapes obtained by hierarchical aggregation of simple shapes in the image in the form of perceptual groupings ([15],[16]) that will be nearer to the classical notion of visual objects and that will enable to develop a new technique for shape recognition in hidden parts at halfway from the statistical ([9]) and structural ([10]) shape recognition methods. The computation of descriptors linked to simple shapes will be extended to complex shapes in order to provide a single way for calculating descriptors whatever is the nature of the shape.

Then our attention will be brought on the quantization of their descriptors that will be used as keys for retrieving the visual information from a data base and the synthesis techniques necessary for rebuilding an image with the help of shapes registered in the data base. It will appear that as soon as they are quantized the simple shapes make up a visual alphabet and that data bases in which complex shapes are registered are then dictionaries of words written with this visual alphabet.

At last in order to encode shapes belonging to an image, we will be interested in curves that ful(ill plane and more especially in the Peano-Hilbert curve ([8]) that has the property to provide a nearest neighbor path in a space of any dimension. It will then appear that the corresponding path makes up a sentence from words belonging to the previous described dictionary and that its syntax may allow to de(ine more globally the nature of a visual scene and the kind of picture scene that has been captured during shooting.

2. Image segmentation

2.1. Introduction

The project is relying on the works led in the past in ADERSA in order to develop a new modeling technique based on a multiple piecewise regression for continuous process control ([28]).

It is a regression technique based on the recursive dividing of a data set applied in an orthogonal way to its main inertia axis using the hyperplane crossing the gravity center of its associated points cloud, until obtaining a minimal approximation error.

The result of this decomposition led in structuring data along a binary tree where data is gathered into subsets of neighboring data (itting a same linear model according to a given approximation error.

This approach is belonging to the (ield of classi(ication and regression trees also named CARTs .

2.2. Experimental foundations of a regular piecewise decomposition

This modeling process has been used for developing parametric command laws in the (ield of continuous process control.

In order to get the linearization of a higher order process, data is re-sampled by inserting as many delays on the output as necessary in order to be able to take in account the physical system order when this one is known. This allows to set up a (inite difference scheme inside the initial data and to estimate the system dynamic behavior, for instance according to its speed and its acceleration.

In use this approximation method showed continuity issues when getting away from the gravity centers of data clouds for crossing from one subset to another neighbor one according to the recursive dividing of modeling data.

In order to reduce the generation of irregularities when using a piecewise linear model, it has been developed a continuous (itting technique relying on the barycentric interpolation of estimated data over two connected pieces inside the modeling tree at the neighborhood of their separating hyperplane.

The continuous (itting is computed as the composition of forms at the straight higher order according to the following way:

where M is the point in the space where it is expected to know an estimated value,

and are the gravity centers of point clouds linked to the left and right children of a node in the modeling tree,

It is similar to a multidimensional (itting scheme of a B-Spline function where the support points are the gravity centers of the point clouds connected according to their separating hyperplane. It produces a new model of directly upper order to the two previous models, that are aggregated by regularizing the crossing from one to another one at the neighborhood of their common boundary, and this according to all the variables of the two initial models, this meaning in a multiple way.

If the modeling data are corresponding to in(initely continuous and differentiable forms the aggregation scheme can be applied in a recursive manner up to the root of the modeling tree : the number of levels of the tree will de(ine the continuous model order so generated.

In practice, it has been implemented using a tuning parameter enabling to reduce its action to the neighborhood of the separating hyperplanes of the piecewise linear model and it has given users satisfaction for reducing bounces when applying control command laws.

The provision of a piecewise linear model regularized up to the order allowed by the hierarchical decomposition of a digitized data set, is monitored by an error threshold given at the generation of a linear model before its regularization. It corresponds to an approximation method trying to (ind a parametric model satisfying this threshold.

The local (it of a model over a digitized data set allows to know until which precision the data is following a linear continuous behavior by smoothing the noise belonging to the modeling data set.

Let us forget for a while the presence of this noise and let us put our attention on the only interpolation scheme provided by the B-Spline (itting.

By relying on the multidimensional data set hierarchically modeled in a space regularly decomposed ([28]), let us now directly apply this interpolating scheme on the data belonging to the nodes of the deepest level in the data set modeling tree and merge two by two nodes while climbing progressively the tree up to reach its root.

At the deepest level in the tree, these data are corresponding to a step function from which it will be possible to compute upper order interpolating functions by successively applying in each space variable the following scheme:

where i is the index number of this space direction and the corresponding coordinate of point M ,

and are the same index coordinates of the gravity centers of the point clouds linked to left and right children of a father node,

and are the forms recursively computed from the functionals recorded in terminal nodes with the help of this interpolation scheme.

By aggregating in such a way the forms met in climbing the tree of the interpolating pyramid, if the forms linked to two children nodes are the same the recursive interpolation scheme will generate the same form at father node level. It will be then locally reached the maximum interpolation order for this data set. If the step function corresponds to the digitization of a continuously differentiable function, we will get its maximum order polynomial expansion according to each variable of the data digitization coordinate system.

2.3. Piecewise regular decomposition of an image

, where is the number of spectral bands of a multispectral image.

For instance, here is shown hereinafter a color picture from which have been extracted on one hand the luminescence image, on the other hand the three colored components red, green and blue. The luminescence image is made from a single monochromatic image when the three colored component set constitutes a multi-chromatic image. These ones are displayed in the form of gray-level pictures, then as a surface drawing that each time described the patch of luminescent intensities in each bandwidth so as to well feel the globally continuous character of data at rough or mid-scale.

Figure 1 : Digital picture representation

In order to get closer to the result of the application of a linear approximation in the least squares approach for an image using multiple resolution levels, it has been chosen to show it using a multi-resolution averaging of this same image, but restricted to the single luminescence image. At each different resolution level is drawn the corresponding map in addition to the averaged image.

Figure 2 : Multi-resolution image approximation

The aim is to localize the irregularities of present order and to extract a separating line inside the image planar domain enabling to divide the present data set into two more regular subsets.

After that a threshold is determined so as to separate the singularities from the other approximation errors in the following way:

or the histogram is bi- or multi-modal, then the threshold will be de(ined by the highest index valley so as to distinguish the singularities from the other errors that will be progressively reduced.

With the help of this equation, the V domain can be divided into two subsets made from more regular data than that of the initial set by evaluating the sign of the following expression for each of its points:

With the help of the previous results, a recursive process of image dichotomous decomposition can be performed in the following way:

initially, the data set to process is composed of all the image data, that is meaning of where with and

the possible irregularities present in the building data set can be detected in the following manner :

Due to this process, a set V has been decomposed into two subsets and so as to gather closely to their boundary the strongest irregularities for reducing their number in each new subset.

The dividing process can be repeated on each subset provided by the decomposition until succeeding in globally ful(illing the same approximation error over the initial image set : it will be the (itting precision of the so produced image piece-wise linear model.

In order to perceive the kind of results that would provide such an approach, the previous multi-resolution display of the different averaging levels of an image is now complemented with the images of the differences between the average of a given level and its value before averaging. Then these difference images are thresholded according to a similar mechanism with the one that has been described in this paragraph so as to only keep the irregularities observed at the current level and drawn them as a surface form. At nearby a scale factor, the irregular areas can be retrieved from one level to another one, that should enable to progressively settle the separating straight lines over the ridges of these ones by successively analyzing every level until reaching the full resolution of the initial image. It would correspond to move from a conventional wavelet scheme to a geometrical wavelet scheme by trying to localize the areas where the image gets a regular behavior so as to be able to (it on them (inite functional expansions.

Figure 3 : Irregularities detection in an image

By starting this iterative process from a full image, it will be got a piecewise linear hierarchical decomposition in the form of a binary tree of (irst order continuously differentiable subsets.

For distinguishing upper order continuities from singularities found in the image, the uncertainty concerning the origin of these irregularities can be removed by successively aggregating models of the tree structure into next upper order models by climbing up the tree. By applying this principle to the two models linked to two terminal nodes children of a same tree node by combining them into a new upper order model that can be associated to their father node. It is necessary to assess the approximation error of the new model over the union of all the data belonging to the two pieces associated to the nodes to be aggregated. If the associated model error ful(ills the expected (itting error, then it was truly a higher order irregularity and not a singularity locally dividing the data into two subsets : it can be envisioned to replace the two children nodes by the father node to which will be linked the new model coming from the aggregating of the two previous models. If it is not the case, then the aggregating process must be stopped and the two children nodes must be kept unchanged with their own models.

There is a third eventuality after having started to aggregate models: it may be that each model linked to two brother nodes can also apply themselves to the union of the data associated to these two nodes : it means that the model order do not temporally increase and that the two children nodes are equivalent at the expected (itting precision. The two children node models can be aggregated into a single one by associating to him the model of the node giving the best (itting error over the two node sets or also the average of the two children models. The phenomena may occur several successive times while climbing up to the root in the modeling tree.

The idea of aggregating two pieces of a given order in order to provide a new one at the consecutive upper order over the union of two adjacent domains is relying on the example of Bernstein polynomials that allow to progressively set up a polynomial interpolation scheme over a mesh of points representing a curve from which it is expected to get an analytic expression ([7]).

This scheme is at the foundation of the building of Bézier curves and their composite version the B-splines.

Due to the creation of intermediate points by barycentric averaging, the Bézier polynomials are built over a succession of convex combinations which implies that the corresponding curve stands inside the convex hull of the points of this curve carrier. This enables to tune the variations of such a polynomial curve and provides it a good regularity.

The proposed way for performing the model aggregation in the aim to get an upper order model takes the use of barycentric averaging by resting on the gravity centers of every data subsets coming from the image hierarchical decomposition. It then leads towards a polynomial approximation of order n no more in each variable of the surface manifold domain, but in all the variables of this one.

then the barycentric interpolation scheme over these two subsets and generates

the straight upper order estimate in all the variables as it follows:

That is by developing in a matrix manner :

As in the building scheme of Bézier polynomials, the process of model aggregation can be recursively applied by climbing the model tree up to stop when the approximation error is no more controlled. If the process can carry on up to the tree root then the image is only made from a single surface manifold which differentiability and continuousness order is equal to the modeling tree depth before anyone aggregation operation.

The (irst aggregation pass will generate the expression of a quadric, that is a polynomial of order 2 in all the variables x and y of the domain of each functional . The second aggregation pass the expression of a cubic, a polynomial of order 3 for every same functionals. In order to provide rendering descriptors by using these pieces of information, it will be not necessary to go beyond the order 3, as it will be shown afterwards.

Working with locally continuous linear estimates allows to free ourselves in some way from the in(luence of capture and digitization noise and not to have to preprocess data before its use. The control is performed by the expected (itting precision of the estimate mean quadratic error over its domain.

The hierarchical aggregation using increasing order polynomials does not enable to provide an image segmentation according to the neighborhood systems classically associated to metric distances, but rather an image partition according to the ultra-metric distance induced by the image treelike division. It is a topology rougher than a metric topology that provides a more broken up segmentation than a classical segmentation. This characteristic should not hamper the processes of image coding and pattern recognition.

At the opposite the predicate used for segmenting images is not an iso-color predicate, but an iso-model one.

In place of a model barycentric combination, another method can be also envisioned for replacing the aggregation of two tree brother models into a new upper order model, it is the linearization of the problem at the order just upper from the initial models and by (inding it using the least squares approach , that is to (ind:

If the linearized expressions are similar to those provided by barycentric combination, then it can be considered that the aggregation by a barycentric combination is the fast transform of increasing order linearized models. As the two methods provide (inite expansions at a given order in all the variables, it can be enable to rely on the properties of Taylor expansions for insuring the uniqueness of these expansions whatever are their orders, when they are existing ([6]).

Whatever is the used approach, it should be better to priory check if lower order models can directly apply on the union of the two domains while ful(illing the expected modeling precision. If it is the case the best lower order model is propagated up to the tree higher level. If it can be possible to aggregate children models by climbing up in the tree, it means that the maximum modeling order has been reached.

After aggregating models up to the p order, as well as their associated tree nodes, the set of

resulting parts constitute an image partition into data subsets that can be approached by

piecewise models of p order. Note the image partition where n( p) is the

number of class C

2.4. Main algorithm for image any order regular decomposition

On the basis of the previous developments, it can be build an algorithm that should perform an image piecewise regular decomposition of any order:

BEGIN

END

FUNCTION Piecewise regular decomposition ( V , order, precision)

BEGIN

In this algorithm version, the maximum aggregating order for models is bounded, but it can be envisioned not to restrict it so and to keep only analytic coef(icients up to the order 3, knowing that the upper order coef(icients give a less important contribution to the modeling if it is referred to the formulation of Taylor residuals.

Before model aggregation, it is not checked if there is one child model can also apply over the data of the neighbor node, because it is probably the case and the other model should also apply on the neighbor data. In that case the aggregation would provide the equivalent of a simple averaging of the two models even if higher order coef(icients are excluded.

At the end in order to validate the approach, it should be assessed in addition that if the barycentric averaging can give a similar result to a least squares linear estimation on data linearized at the corresponding order to the one provided by the aggregation used in the previous algorithm.

If it is referred to the representation in the wavelet form used for illustrating what kind of results could be awaited in applying this approach in image modeling, it should be possible to simplify the algorithm that have just been presented, not in trying to make an image piecewise linear decomposition but by starting aggregation over the piecewise constant representation delivered by the image multi-resolution averaging, which is a step function more or less at a mean quadratic error.

More generally, it can be noted that the progressive decomposition of a global linear model into piecewise linear models stops when the noise in(luence has been monitored by the least squares method over all the pieces and that the distinction between irregularities and singularities is made by aggregating models at upper interpolation order up to that the approximation error no more ful(ills the expected precision : in this case, there is locally a singularity that the method does not succeed to handle.

3. Calculus of descriptors

3.1. Describing visual shapes

At the end of the segmentation, an image appears in the form of a visual element decomposition that has the following properties :

each element can be represented as a functional or a collection of functionals leaning on a connected subset of the image domain ;

each of these functionals can be approached by a (inite order polynomial more or less an error ;

those that are proper to the functionals that are leaning on these pieces that can be approached by a (inite order polynomial.

3.2. Descriptors associated to the shape domains

The moments of the domain V at the order 0, 1, 2 and 3 are given in a continuous way by the following formulas:

From the order 2 moments are deduced the main inertia axis and its angle relatively to the observation coordinate system. This one is computed more or less 180° and the uncertainty around 360° is solved by examining the sign o the asymmetry according to the main axis after having moved in the own coordinate system of the domain.

The translation invariance of moments can be deduced by their correction according to the domain gravity center coordinates.

Actually, it can be deduced from the domain inertia ellipsoid for the major axis only a direction but not an orientation.

The crossed moment gets null and the order 3 moments in the directions and become:

The domain main axis asymmetry is a characteristics that belongs to it and which is stable once determined the orientations of inertia axes.

one’s disposal spatial characteristics about this domain that are translation and rotation invariant in the shooting plane.

These characteristics become homothety invariant by dropping the domain area and by normalizing the remaining moments by the major inertia axis value. It will remain:

It should be added to the shape localization information the scale factor constituted by the major inertia axis in order to correctly replace this one in the image plane.

3.3. Descriptors of shape rendering

In the aim of providing measures enabling to describe the surface pieces coming from the piecewise regular decomposition of a digitized image, it will be focused on models of at most order 3 for computing these measures or their expansion up to order 3.

For each subset , of the image partition got by regular decomposition up to order 3, each spectral band can be approximated by the cubic:

For doing it, place ourselves in the coordinate system of the tangent plane of the spectral band estimate centered at the gravity center of the piece .

So moving to the new coordinate system associated to the tangent plane of the surface centered at the

domain gravity center of the piece returns in applying the following linear transform on every

points of the concerned surface :

that is also:

in order to keep unchanged the tangent sign when the rotation matrix will

be built.

In the coordinate system linked to the tangent plane centered at the piece gravity center, the surface

equation of the local image estimate takes the form of the reduced expression :

Because it corresponds also to the (inite expansion of the surface equation at the point M that can be

written by using the Taylor formula up to the order 3 :

To express the local image estimate in the coordinate system associated to the tangent plane centered at the piece gravity center turns then in canceling the constant term and the gradient of the function in its polynomial expression.

Carry on the reduction of this equation by calculating its expression in the own coordinate system of the quadric

4. Building perceptual groupings

4.1. Introduction

At the beginning of XXth century, psychologists have built a holistic theory for perception named Gestalttheorie. It was about looking for what are the laws that are ruling human visual perception. Psycho-sensory experiments have allowed to (ind six main laws:

the good shape law: a shapeless part set tends towards being (irst perceived as a simple shape, symmetric and stable;

the law of continuity: gathered points are perceived the expansion of each others as a single shape;

the law of proximity: the points the nearest from each others are (irst and foremost gathered;

the similarity law: if they are not enough near, the most similar between them are gathered into a single shape;

the law of common destiny: moving parts having the same trajectory are perceived as belong to the same shape;

the law of familiarity: the most familiar shapes are perceived as being the most signi(icant ones.

This perception approach relies on the spatial structuring of visual shapes and enables to envision and develop partially hidden pattern recognition methods by identifying wholly or partly an object in one’s view (ield.

The computation of digital descriptors over visual shapes (good shape law) provided by an image segmentation (law of continuity) allows to recognize only fully visible objects in the view (ield. Which is hardly possible knowing that human vision only provides a planar view on a tridimensional physical universe by using a projective transformation.

In order to solve this issue, it is proposed to adapt the attribute calculus so as to include some laws of shape perception by aggregating the closest shape attributes (law of proximity) along the treelike path of the shapes that are belonging to an image.

This approach is holding simultaneously the advantages of the statistical pattern recognition (similarity law) as well as those of structural pattern recognition (law of familiarity). It has been already tested in the framework of a past application for facial recognition in planar color images.

The monitoring of moving groupings image by image in a video should illustrate he law of common destiny.

The attribute calculus implies that the object of interest does not include some hidden part and may appear in the form of a single connected component. These conditions are generally not satis(ied: only one part of a tridimensional object can be viewed by a single observer and image segmentation may provide a decomposition of the observed object into multiple connected components. In order to solve this issue, it is proposed to insert an intermediate computing step based on the following holistic scheme: neighbor regions are successively aggregated in the form of compound regions until being able to rebuild as wholly as possible the interesting object. The calculus of descriptors is so generalized to compound objects and an object will then appear as a series of components nested the ones into the others, that is to say a series of complex shapes. Each one among them will have at one’s disposal its own attribute vector and should be individually identi(ied: so it will be enabled to recognize an object without seeing it entirely by identifying part of the parts that are composing it.

4.2. Shape aggregation

Moment-based attribute calculus assumes that objects to recognize are viewed as a whole. It is not actually ful(illed with shape recognition applications. Statistical learning made from the recording of several views of a same object observed with distinct positions and attitudes, allows to create a suf(icient dense set of feature data for expecting to recognize any incoming new view standing near some previously recorded ones.

But the moment-based attribute calculus using low-order formulas allows only to identify compact shapes. On another side, the image piecewise regular segmentation would provide a richer object decomposition in more numerous regions. To take bene(it of this property, it is proposed to reuse the shape aggregative process used by the past in facial recognition ([30]). Usually region merging proposes to aggregate adjacent regions. To avoid building up a regional adjacency graph, an alternate aggregative approach based on region nearness has been developed. During region aggregation, attribute calculus is performed without centering, normalizing and scaling. As moments are integral formulas, attributes can be then straightforwardly summed up before being post-processed. Using gravity center information, a regional gravity center quad-tree can easily be built and allows to merge nearest regions using a bottomup tree traversal. As the merging process is recursively repeated up to tree root, the various objects of interest belonging to the image can then be successively decomposed in a nested set of regions.

After attribute summation, the features are centered, normalized and scaled at every decomposition level of an image model. So an image object can be then described as a variable length series of feature vectors. At training time, the labels are applied on vector series allowing to provide a recognition step running on wholly or partially visible objects.

4.3. Facial recognition example

The statistical learning step made from the recording of several views of a same person, observed according to different positions and attitudes, allows to create a dense set of characteristics data in the aim to recognize a person viewed in a situation close to those that have been already registered.

The example here presented is based on color image segmentation. The shape aggregation method takes in account most of the region groupings that can appear in the decomposition of a face. As the aggregation process is recursively repeated up to the top of the tree of the region gravity centers, the face of a person can be progressively recomposed as a set of nested regions.

In this example, only shape domain descriptors are taken in account. After having summed up the attributes of the whole regions, the characteristics can be centered, normalized and scaled at each decomposition level of a face model. So a face can be described by a variable length series of characteristics vectors. During learning, labels are assigned to these vector series in order to set up a facial recognition stage.

An example of shape progressive aggregation is shown underneath. The attribute list of the various objects provided by aggregation of such images is next after displayed. They are consisting in localization and shape attributes. The gravity center coordinates, the angle and the scale factor are localization attributes of the corresponding object in the shooting coordinate system. The surface, the eccentricity and the asymmetries compose the shape attributes of the object. They are centered, normalized and scaled values, which are independent from any translation, rotation and homothety in the capturing coordinate system.

In this speci(ic application, the decision of authenticating or identifying one person among several is taken according to the label gathering the maximum number of recognized simple or compound shapes.

Figure 4 : Hierarchical aggregation of regions

4.4. Shape aggregation algorithms

In the concerned case, it is unnecessary to build the gravity center tree of all the regions that make up the image, because it is the same that is provided by the image piecewise regular decomposition.

It will be just enough to modify the piecewise regular decomposition algorithm used for aggregating shape descriptors by deleting the test about the order not to overcome concerning them and to restrict the algorithm in only keeping the order 3 coef(icients generated by barycentric combination. Here is the piecewise regular decomposition algorithm revised in this intent while introducing the explicit handling of the data treelike structure:

BEGIN

END

PROCEDURE Piecewise regular decomposition ( V , root, order, precision)

BEGIN

END

At the difference of the previous computation of shape domain descriptors, this computing step must not being stopped at the level of terminal nodes, but must be carried on up to the decomposition tree root. It will be only after having reached it that the shape domain descriptors could be entered and normalized up to the tree root. It is the same with the shape rendering descriptors where the barycentric combination truncated at order 3 will be used for computing the coef(icients of the aggregated up to the third expansion order.

4.5. Similarity in multidimensional hierarchical modeling

For implementing a shape recognition process based on visual descriptors, it is proposed to focus our attention on multidimensional hierarchical modeling techniques.

The modeling of multidimensional data sets by the means of hierarchical treelike structures implies that the sets are managed using an ultra-metric distance. Concerning regularly divided spaces, the structuring distance is the Hausdorff distance: the distance between two points in the space is the size of the smallest subset in the space that is holding together these two points. It is not a distance in the conventional meaning that allows to measure the distance between two points from the space but a pseudo-distance used for comparing subsets between them.

As the characteristics describing objects are normalized, it might be possible to use an unweighted euclidean distance for de(ining a similarity measure. But it would be rather considered each characteristic vector as a point belonging to a multidimensional space of dimension equal to the number of characteristics computed over the image objects. Consequently the characteristics coordinates are straightly managed by the treelike structure representing the learning set and this enables to compare directly each new vector with anyone vector registered during learning. So conditions are more favorable for assessing similarity between known and unknown objects. This skill will be especially useful for detecting and identifying moving objects in image series and eventually for enriching a learning base with the characteristics computed on these new objects. Moreover, this similarity measure induced by a treelike structure enables to sort all the objects belonging to a data base according their similarity with a sample vector (query by example) or according their self-similarity (data base sort).

5. Building visual dictionaries

5.1. Vector quantization of shape descriptors

For numerically describing simple and compound shapes provided by an image piecewise regular segmentation, we have at our disposal for each shape:

an attribute vector associated to its shape domain, based on the computation of generalized moments, and consisting in values invariant to the geometrical transformations that can apply on an object in the image plane;

and for each spectral component of the image, a complementary vector describing the shape rendering as well as invariant but in the functional space of the image.

In order to get a description up to a given precision for this numerical information, it is proposed to use a treelike structure for handling these numerical data vectors ([13],[28]), enabling to both represent images and visual objects attributes so as to perform shape recognition processes using statistical learning.

As a reminder, it is an indexing scheme based on binary tree structures enabling to represent spaces of any dimension regularly divided and used for registering numerical data coming from a learning set that has been labeled or not with the names of objects to be recognized ([11],[12]). For representing a set of numerical data belonging to a k dimension space, the data must be registered into a binary tree where the attribute vectors are used as geographical coordinate keys for retrieving the information from the tree. For localizing a data vector, it must be followed a traversal similar to the one made for searching in a quaternary or octernary tree:

according to the divide and conquer paradigm, the multidimensional space is divided half by half successively along each space direction until reaching the modeling precision given as a search parameter;

at each dividing step of the space, the vector coordinates are assessed and the decision to go towards such or such left or right half-space is taken until reaching the required analysis precision in the tree.

Let us take the example of the addition of a normalized vector, as it is the case of shape descriptors, to a binary tree modeling the unitary hyper-cube of dimension k at the precision r :

BEGIN

END

PROCEDURE Addition of a normalized vector to a binary tree (vector, root, minroot, maxroot, level, dimension, depth)

BEGIN

END

This recursive procedure enables to record the presence of a vector of any dimension in the unitary

space of the same dimension by handling this space in the form of a binary tree, that is equal to map

This one is created empty using the instruction root ← WHITE, that generates a white tree-node representing an empty space of any dimension at any precision. The tree branches grow progressively at each vector addition with the help of the operator FISSION and two iso-colored children nodes representing two neighbor vectors at precision r are merged into a single node at father level with the help of the operator MERGE. The nodes BLACK and WHITE are tree terminal nodes that tell if some data is present or not in the branch that have been traversed.

Mapped into a binary in such a way, a multidimensional space can be seen as a series of quadrants, octants or hexadecants successively nested the ones into the other ones according to the space dimension. Their union gives back the whole initial space in which they are mapped and a simple binary searching enables to retrieve any point or subset belonging to this space modeled in such a way. The precision is a parameter used during the building or the search of information in this space and corresponds to the traversal depth of the representation tree.

Then a learning data set can be represented as a collection of labeled trees built at a given precision where the labels are the identi(iers of the objects to be recognized, indexed by their descriptor vectors computed during image analysis, at the rate of one tree speci(ically for the domain attributes and as many trees as they are spectral components. More the precision is low, more the quantization will be rough and more it is high, more it will be detailed.

5.2. Visual alphabet of simple shapes

The simple shapes provided by an image piecewise regular segmentation consist in:

an attribute vector describing the domain of a shape, based on the computation of generalized moments;

and for each image spectral component, a complementary vector describing the rendering of a shape.

The attributes describing the domain of each shape are the values:

the normalized asymmetries.

Concerning each image spectral component, the rendering of a shape is described by the series of

values: ,

From a practical viewpoint, it may restrict a shape to the convex hull of its domain, that is standing in the

enough for describing a quadric surface.

All these values are independent from the linear transformations that may apply on shapes in the visual space in which they have been captured. After quantization, they can be reduced to (inite subsets of normalized values that can be seen as visual alphabets enabling to encode images in the form of series of ideographic characters as it was the case with the (irst writing systems developed by humanity. According to the precision used for quantifying simple shapes, the alphabet so produced can be more or less rich.

Let they be the descriptors of shape domains as well as their rendering ones, the quantization is performed by building their trees in their own coordinate systems at a precision wished for recognizing them, then building them back during their synthesis.

As the space is unitary, the algorithm presented in the previous paragraph can apply not only for encoding the values of these descriptors, but also the points sampling the shape of each domain so as to be reused during image synthesis. So each shape domain can be discretized in the unitary plane and can be represented by a quaternary tree, for which it has been shown in the past that is less cumbersome than a run length code ([28]). As the algorithm is additive, several occurrences of a same domain can be accumulated at learning step so as to get a visual description rich and stable as it has already been told about moving objects in a video sequence.

Concerning the descriptors of shape rendering, it is not necessary to go beyond the quantization of their coef(icients, it will be enough to discretized the expression of their coef(icients over the shape domain for rebuilding the shape as it has been captured before any processing.

The description of simple shapes as it was just described may also be compared to the phonetic systems used by the present alphabets made from consonants and vowels whose conjunction allows the development of a syllabic writing: each simple shape is the conjunction of the domain of a geometric shape and an analytic expression detailing the visual rendering to be performed for rebuilding the simple shape initially captured and extracted from an image. The domain can be then assimilated to a consonant and the rendering to one or several vowels. That is why this primary visual set has be called viseme by analogy with the wording of phoneme that the syllabic writing attempt to represent: the phoneme is described as the smallest discrete unit that can be distinguished by segmentation of the speech ([3]).

5.3. Visual dictionary of complex shapes

Grouping simple shapes into compound shapes is at the foundation of the technique of partially hidden shape recognition proposed right here and enabling to identify tridimensional real objects observed in projective geometry by a planar sensor. According to the position and the attitude of the observer in front of the observed object, it will be produced various gathering series of simple shapes whose tree views will describe the organization and will enable to identify series of common shapes as identical branches or sub-branches. Each tree view of simple shapes describing a given compound shape consists in a word built using the previously described alphabet of visemes. So a same tridimensional object observed according to various viewpoints will generate as many words using this alphabet as they will be many occurrences for describing it: from a lexicographic viewpoint, they are each others synonymous.

So in the example of face recognition, the attribute table establishes according to this way one of the words enabling the indexing of the face that has been segmented in the aim to be labeled with the name of the person to which it is belonging. More precisely, it is only describing the consonants of its composition, the vowels being hidden by the segmentation performed in this particular application (rendering not taking in account).

The set of compound shapes found in an image then enables to build the dictionary of image shapes: it initially owns an organization similar to the one provided by a piecewise regular segmentation of the same image, but it can be distinguished by the fact that it is indexed by an alphabet of simple shapes in place of surface pieces of a given order. This dictionary may be enriched not only with the shapes belonging to a single image, but also with a video sequence by identifying the groupings that are following a common movement.

Images and video sequences can be multiplied for consolidating its building and recognizing similar sequences and identifying visual synonyms.

Therefore it is dealing with the building recognition bases by using learning:

supervised learning by naming the observed objects in the (irst image of a sequence corresponding to a still video frame, then by enriching the dictionary by following still and moving objects image by image;

unsupervised learning by leaving the enrichment process identifying by itself similar objects and by naming them afterwards the processing end.

So by labeling the similar objects of the learning base, the dictionary gains an ontological dimension in being able to name the observed objects. The richness of the vocabulary recorder into a visual dictionary varies with the depth of the indexing tree of the compound shapes that it is handling.

5.4. Synthesis of simple and complex shapes

With the help of the dictionary of complex shapes located in an image, each complex shape must be decomposed into a series of simple shapes that will be put in location and place, as it has been registered:

by placing the shapes domains there where they have been localized, that is at in the image plane after scaling with the ratio ;

For (illing the holes and avoiding to have some value duplication, it should be preferable to apply a median (iltering on the image of the shape domain labels before digitizing the analytic expressions of each spectral component.

6. Image and video coding

6.1. Optimal path along the blocks of a regularly divided image

A planar image can be regularly divided into blocks of same size using treelike representations as for instance quad-trees or quaternary trees. The Hilbert or Peano-Hilbert curve enables to go all over these blocks by moving towards the nearest neighbor by crossing once and only once the same block. Having got a quad-tree model of an image, it is easy to perform a recursive traversal that provides such a curve for visiting each point of a given image ([8]).

it is a recursive algorithm visiting the nodes of a quad-tree according to a depth-(irst traversal;

at each tree-node it is linked an orientation that will de(ine in which order its children nodes must be visited;

when they are visited, the orientation of each child node is computed back from the one linked to its father so as to make the visit of its grandsons homothetic to this one used for its children more or less a given rotation;

for different orientations can be provided each one corresponding to a building pattern of the Hilbert curve that can be deduced one from another one according to a rotation of 90°.

Figure 5: Four basic patterns of a planar Hilbert curve

These four patterns are shown above ([27]) including their developments at the following resolution.

6.2. Visiting path for the objects of an image

It is proposed to use such a path for visiting the different pieces and their groupings in an image for performing its encoding. As there is an identity of structure between the image piecewise regular decomposition and the tree of the gravity centers of the same pieces, it can be attempted to visit all the pieces by moving towards the nearest neighbor according to the location of their gravity centers. That will be also equal to minimize the eye motion of an observer that would rather examine successively each shape belonging to the image by viewing them once and only once.

Here is for instance one of the possible path enabling to visit several towns in some country, by minimizing the movements from one town to another one ([29]). Their coordinates are given in a cartographic way (latitude and longitude on the terrestrial planisphere).

Figure 6 : Map of the main towns of France in polar coordinates

Figure 7 : Search of a visiting path by a Hilbert scanning

Figure 8 : Nearest neighbor visiting path of the main towns in France

6.3. Visual syntax and phraseology

Such a visiting path is envisioned for encoding the simple shapes as well as the complex shapes found in an image. It enables to list the various shapes making an image by moving from shape to shape at the nearest neighbor. From a syntactical point of view, it enables to represent an image as the sentence of a series of words registered in a dictionary of complex shapes that are themselves written using the simple shapes alphabet whose consonants are the domains of these shapes and vowels the analytic expressions of their rendering.

As the path will constitute an image scan starting nearly always at the same place and following the same motion rules, it can be envisioned that these sentences may be categorized so as to describe :

the composition of the visual scene by referring to conventional scenes (see, land, mountain landscapes, group overview, portrait, etc.);

6.4.Multidimensional sorting of a shape base

The modeling of regularly decomposed spaces of dimension k by trees of order embedded into binary trees shows that this representation model owns the cardinality of the continuum, that is to say that it can be found a continuous transformation that maps subsets from straight in R .

This means that it can be found paths enabling to visit in a continuous way once and only once the whole set of a base: in the case of -trees by moving towards the nearest neighbor according to the Hausdorff distance.

For a multidimensional data base, it is equal to (ind a visiting path maximizing step by step the similarity index of registered data, that is sorting the data of the base according to their similarity.

In the research project quoted in introduction, it was envisioned to provide an image editing module where shape libraries may be built by learning and indexed using their descriptors so as to be able in reaching them in a sorted way according to their similarity in the multidimensional modeling space, moving towards the nearest neighbor at variable precision.

Conclusion

the segmented image enables to de(ine homogeneous objects (segmentation of the image domain into components sharing common properties in the functional image space);

the geometrical structure: decomposition of the segmented domains over a base of geometric primitives, description level of an image equal to graphic information;

the relational structure: grouping of subsets of the initial image partition by discovering spatial relation or similarities with known organization models.

It echoes to the article that publishes at the same while J. Mariani in a special issue of the periodical « La Recherche » focusing on « L’Intelligence Arti(icielle » ([3]) and in which he describes the speech recognition as a process decomposed into four levels plus one more proper to the aimed application:

the acoustic level: level where the phonic signal is sampled and from which are extracted the acoustic features;

and at last the pragmatic level: level linked to the application, usable in a (inite universe de(ined in advance, without that the previous steps cannot be validated.

Figure 9 : In formation pyramidal nesting in a perceptual system

As it is shown by the developments proposed in the research program that has just been described, it cannot prevent oneself to establish a link between these two approaches of arti(icial perception in order to state a parallelism as it is shown above in the form of a pyramidal organized diagram.

By taking back the expected results described in the research proposal submitted to the initiative OpenFET of the VIth PCRD, the project SDVC (“Self-Descriptive Video Coding”) tries to build the foundations of a solution proposing an innovative representation for describing the visual information that should allow to de(ine visual ontologies and to handle visual content similarly to textual content. With the help of this modeling scheme, SDVC should provide an ef(icient tool for querying quickly mixed content (textual and visual) and might open a similar manner for taking in account audio contents in the future.

In this objective, the project SDVC aims to settle innovative research activities and should enable to improve the state of art about this subject:

by proposing a description of the information at an intermediate level enabling to build more ef(icient retrieval systems for the image and video information;

by taking in account the perception laws so as to bridge the gap between digitized objects and the perspective views of objects belonging to the real world;

by managing visual objects independently to their localization in the view plan and to a set of given geometric transformations;

by proposing an indexing scheme from which visual ontologies can be inferred and by enabling to de(ine a kind of visual alphabet for simply connected shapes and a visual dictionary including perceptual groupings;

and consequently by providing a means for bridging the semantic gap existing at the heart of visual information ([26]).

The project shows a strong exploratory perspective and open the path for future investigations about the representation of contents and technologies of information self-descriptive coding.

The project has been selected during the (irst step of the project call Open-FET (open project for future and emerging technologies) under the name of GROOVIES, but has not unfortunately be funded. It has been submitted at a while where research on multi-scale representations and geometrical wavelets was intensive ([17]-[23]). The same representations are presently mobilized in order to try and explain the ef(iciency of learning using multi-stage convolutional networks ([24]). The works that have just been presented perhaps might unveil parts of some questions, notably for knowing how many stages are necessary for insuring the good operation of such a network.

Bibliographic references

1. D. Lecomte, D. Cohen, Ph. De Bellefonds, J. Barda, Les normes et les standards du multimédia –

2. D. H. Ballard, C. M. Brown – Computer Vision - Prentice Hall ,1985

3. J. Mariani - La Reconnaissance de la Parole in "L'Intelligence Arti icielle", Numéro spécial La Recherche, Octobre 1985

4. Th.. Pavlidis - Algorithms For Graphic and Image Processing. Computer Science Press, 1982

5. F. P. Preparata, M .I. Shamos - Computational Geometry: An introduction. Texts and Monographs in Computer Science - Springer Verlag, 1985

6. A. Avez – Calcul différentiel – Maîtrise de Mathématiques Pures, Masson 1983

7. J.-J. Risler - Méthodes Mathématiques pour la CAO – Collection Recherches en Mathématiques Appliquées - Masson 1991

8. J.-C. Simon – La reconnaissance des formes par algorithmesInformatique, Masson 1984

9. G. Gaillat – Méthodes statistiques de reconnaissance des formes automatique – E.N.S.T.A. 1983

10. L. Miclet – Méthodes structurelles pour la reconnaissance des formes – Collection technique et scientique des télécommunications, Eyrolles 1984

11. J. L. Bentley - Multidimensional Binary Search Trees Used for Associative Searching - CACM, Vol.

12. J. L. Bentley - Multidimensional Divide-and-Conquer

13. P. Fiche, V. Ricordel, C. Labit - Etude d'algorithmes de quanti(ication vectorielle arborescente pour la compression d'images (ixes - Programme 4 Robotique, image et vision - Projet Temis -

14. B. S. Manjunath, P. Salembier, T. Sikora (Eds.), Introduction to the MPEG-7 Multimedia Content Description Language, Wiley & Son, 2002

15. J. M.Buhmann,J. Malik, P. Perona, Image recognition: Visual grouping, recognition, and learning, 5th annual German-American Frontiers of Science symposium, Postdam, June 10-13, 1999

16. J. Luo, C.-E. Guo, Perceptual Grouping of Segmented Regions in Color Images. Pattern Recognition 36(2003) 2781-2792

17. Ch. S. Sastry, A. K. Pujari, B. L. Deekshatulu, C. Bhagvati, A Wavelet Based Multiresolution Algorithm for Rotation Invariant Feature Extraction, Pattern Recognition Letters, Vol. 25, Issue 16 (Dec. 2004)

18. T. Lindeberg, Scale-Space Theory in Computer Vision, Kluwer Academic Publishers, Dordrecht, Netherlands, 1994

19. F. Mokhtarian, S. Abbasi, J. Kittler, Ef(icient and Robust Retrieval by Shape Content through Curvature Scale Space, Image and Databases and Multi-Media Search, pp 51-58, World Scienti ic Publishing, 1997

20. D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision, vol. 60, no. 2, 2004

21. J. Romberg, M. Wakin, and R. G. Baraniuk, "Multiscale Geometric Image Processing," presented at SPIE Visual Communications and Image Processing, Lugano, Switzerland, 2003

22. S. Mallat, "A Theory for Multiresolution Signal Decomposition: The Wavelet Representation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11,

28. http://arxiv.org/abs/1605.00961 O. Guye, Hierarchical Modeling of Multidimensional Data in Regularly Decomposed Spaces: Main Principles

29. http://arxiv.org/abs/1605.00967O. Guye, Hierarchical Modeling of Multidimensional Data in Regularly Decomposed Spaces: Implementation on Computer

30. http://arxiv.org/abs/1605.01242O. Guye, Hierarchical Modeling of Multidimensional Data in Regularly Decomposed Spaces: Applications in Image Analysis