Package gmfd (Generalized Mahalanobis Functional Distance) is an R package which gathers statistical methods for both inference and clustering of functional data based on a generalization of Mahalanobis distance. The package supports both univariate and multivariate functional data.
Let us consider a process X(t) ∈ L2(I),
with I ∈ ℝ, with mean function
m(t) = 𝔼[X(t)]
and covariance operator V,
i.e. V is a linear compact integral operator from L2(I) to L2(I) such that
(Va)(s) = ∫Iv(s, t)a(t)dt ∀a ∈ L2(I),
where v is the covariance
function defined as v = 𝔼[(X(t) − m(t))(X(s) − m(s))].
Then, denote with {ρk, k ≥ 1}
and {θk, k ≥ 1}
respectively the sequences of eigenvalues and eigenfunctions of v. The Karhunen-Loève expansion
decomposes the process X(t) in a sum of its mean
m(t) and the series
of orthonormal functions θk(t),
each one multiplied by zero-mean uncorrelated random variables $\sqrt{\rho_k} Z_k$, (ρk > 0, 𝕍ar(Zk) = 1).
Then we can write The function
gmfd_simulate( size, mean, covariance, rho, theta )
can
simulate an univariate sample of functional data where size
represents the number of elements to be generated while
mean
is the center of the distribution. The user can choose
to use the argument covariance
for the covariance of the
data or alternatively the sequences of eigenvalues and eigenfunctions of
the covariance matrix.
library( gmfd )
# Define the parameters
n <- 50
P <- 100
K <- 150
# Grid of the functional dataset
t <- seq( 0, 1, length.out = P )
# Define the means and the parameters to use in the simulation
m1 <- t^2 * ( 1 - t )
rho <- rep( 0, K )
theta <- matrix( 0, K, P )
for ( k in 1:K ) {
rho[k] <- 1 / ( k + 1 )^2
if ( k%%2 == 0 )
theta[k, ] <- sqrt( 2 ) * sin( k * pi * t )
else if ( k%%2 != 0 && k != 1 )
theta[k, ] <- sqrt( 2 ) * cos( ( k - 1 ) * pi * t )
else
theta[k, ] <- rep( 1, P )
}
# Simulate the functional data
X <- gmfd_simulate( size = n, mean = m1, rho = rho, theta = theta )
S3
class funData
for functional dataFor ease of manipulation and visualization, a S3
class
for both univariate and multivariate functional data has been created. A
funData
object represents a functional data and it is
defined by the function funData( grid, F )
where
grid
is the grid of evenly spaced points over which the
functional data is defined and F
is a matrix (if it is a
univariate functional data) or a list (if it is a multivariate
functional data). A functional data as it has been just described can be
then represented by using the function plot.funData
which
takes as argument all the usual customisable graphical parameters.
# Create a functional data object
FD1 <- funData( t, X )
# Plot the funData object
plot( FD1, col = "blue" , xlab = "grid", ylab = "data")
The empirical version of the generalized Mahalanobis distance based
on the covariance estimator v̂
can be written as follows: where P is the length of the independent
variable grid, while d̂M, k2(⋅, ⋅)
and ĥ(p) represent
the estimates of the square of the contribution to this distance along
the k-th component and the
regularizing function, respectively. The function
funDist( FD1, FD2, metric, p, lambda, phi )
computes the
distance between two functional data FD1
and
FD2
by using the chosen metric
. The last three
parameters are used only for the generalized Mahalanobis distance.
# We estimate the eigenvalues and eigenfunctions of the covariance matrix of data
lambda <- eigen( cov( FD1$data[[1]] ) )$values
phi <- eigen( cov( FD1$data[[1]] ) )$vectors
# Extract two curves from the samples to compute the distance between them
x <- funData( t, FD1$data[[1]][1, ] )
y <- funData( t, FD1$data[[1]][2, ] )
distance <- funDist( x, y, metric = "mahalanobis", p = 10^5, lambda, phi )
distance
## [1] 0.9153422
It is also possible to compute the dissimilarity matrix of a given
sample by using the function
gmfd_diss( FD, metric, p )
.
We want now to compare the means of two functional data samples. We
simulate two samples X1(t), ..., Xn1(t)
and Yn2(t), ..., Yn2(t)
and consider the following asymptotic hypotesis test: where m1 and m2 are the real means of
the two simulated samples. We can infer on the means of the two samples
by using the function
gmfd_test(FD1, FD2, conf.level, stat_test, p)
where we have
the two samples FD1
and FD2
, the confidence
level for the test conf.level
, a string to choose the test
statistic to use stat_test
and the parameter of the
regularizing function p
. The function then returns the
value of the test statistics, the value of the quantile and the p-value
for the test.
# Simulate another functional sample
s <- 0
for ( k in 4:K ) {
s <- s + sqrt( rho[k] ) * theta[k, ]
}
m2 <- m1 + s
Y <- gmfd_simulate( n, m2, rho = rho, theta = theta )
FD2 <- funData( t, Y )
test_output <- gmfd_test(FD1, FD2, conf.level = 0.95, stat_test = "mahalanobis", p = 10^5)
test_output
## $T0
## [1] 95.34776
##
## $quantile
## [1] 12.60675
##
## $pval
## [1] 0
The functional k-means clustering algorithm is an iterative procedure, alternating a step of cluster assignment, where all the curves are assigned to a cluster, and a step of centroid calculation, where a relevant functional representative (the centroid) for each cluster is identified. More precisely, the algorithm is initialized by fixing the number k of clusters and by randomly selecting a set of k initial centroids {χ1(0)(t), …, χk(0)(t)} among the curves of the dataset. Given this initial choice, the algorithm iteratively repeats the two basic steps mentioned above. Formally, at the mth iteration of the algorithm, m ≥ 1, the two following steps are performed:
Step 1 (cluster assignment step): each curve is assigned to the cluster with the nearest centroid at the (m − 1)th iteration, according to the distance d̂p. Formally, the mth cluster assignment Ci(m) of the ith statistical unit, for i = 1, …, n, can be written as follows:
Step 2 (centroid calculation step): the
computation of the centroids at the mth
iteration is performed by solving the optimization problems: for any
l = 1, …, k,
where Ci(m)
is the cluster assignment of the ith
statistical unit at the mth
iteration. The algorithm stops when the same cluster assignments are
obtained at two subsequent iterations, i.e. the set of cluster
assignments {C1(m̄), …, Cn(m̄)}
and the set of centroids {χ1(m̄)(t), …, χk(m̄)(t)}
are considered final solutions of the algorithm if m̄ is the minimum integer such that
Ci(m̄ + 1) ≡ Ci(m̄)
for all i = 1, …, n.
We apply the procedure by merging the two samples X1(t), ..., Xn1(t)
and Yn2(t), ..., Yn2(t)
previously simulated using the function
gmfd_kmeans( FD, n.cl , metric, p )
where n.cl
is the number of cluster. It returns a vector of the clusters and a
vector or a list of vectors of the centers, other than a plot of the
clusters along with their empirical means.
# We estimate the eigenvalues and eigenfunctions of the covariance matrix of all merged data
lambda <- eigen( cov( rbind( FD1$data[[1]], FD2$data[[1]] ) ) )$values
phi <- eigen( cov ( rbind( FD1$data[[1]], FD2$data[[1]] ) ) )$vectors
# Functional data sample containing the merged samples
FD <- funData( t, rbind( X, Y ) )
kmeans_output <- gmfd_kmeans( FD, n.cl = 2, metric = "mahalanobis", p = 10^5 )