You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

A New Decision Making Method for Selection of Optimal Data Using the Von Neumann-Morgenstern Theorem

Abstract

The quality of the input data is amongst the decisive factors affecting the speed and effectiveness of recurrent neural network (RNN) learning. We present here a novel methodology to select optimal training data (those with the highest learning capacity) by approaching the problem from a decision making point of view. The key idea, which underpins the design of the mathematical structure that supports the selection, is to define first a binary relation that gives preference to inputs with higher estimator abilities. The Von Newman Morgenstern theorem (VNM), a cornerstone of decision theory, is then applied to determine the level of efficiency of the training dataset based on the probability of success derived from a purpose-designed framework based on Markov networks. To the best of the author’s knowledge, this is the first time that this result has been applied to data selection tasks. Hence, it is shown that Markov Networks, mainly known as generative models, can successfully participate in discriminative tasks when used in conjunction with the VNM theorem.

The simplicity of our design allows the selection to be carried out alongside the training. Hence, since learning progresses with only the optimal inputs, the data noise gradually disappears: the result is an improvement in the performance while minimising the likelihood of overfitting.

1Introduction

The superiority of artificial neural networks (ANNs) in various tasks (classification, pattern identification, prediction, etc.) has led researchers to focus much of their efforts on the study of the functioning of their components from a theoretical perspective, see Higham and Higham (2019), Smale et al. (2010). It is well known that ANNs have a high capacity for learning, the effectiveness of which depends on many factors. Amongst them, the problem complexity influences to a high degree the ANN performance, which depends not only on the ANN architecture, but also on the accurate and sufficient training data and the efficiency that datasets show throughout the process. Training in recurrent neural networks (RNNs) – ANNs that stand out for their high capacity for learning and recognition of temporal patterns – depends to a large extent on size, type and structure of the selected training sets (Chen, 2006; Zhang and Suganthan, 2016; Zapf and Wallek, 2021): this is such a central point that decisively influences both the speed and the ability to learn. During the training phase, where the unknown parameters are to be determined, the quality and learning capacity of the selected training datasets are of key importance (Mirjalili et al., 2012).

The main objective of this paper is to provide a robust methodology to select optimal training datasets (those with the highest learning capacity) that can be used in any context to maximise the performance of the trained models. This methodology has been designed to run in parallel with RNN learning so that, while the RNN learning evolves progressively only with the optimal training inputs, the data noise gradually disappears. This has a positive impact on the quality of the RNN results while minimising the likelihood of occurrence of overfitting.11 The key idea in the design of the mathematical structure that supports this selection is to define a binary relation that gives preference to those datasets with higher estimator abilities by using Utility theory. A second contribution of our work is to have designed our methodology based on tools that have not been used previously for this. This novelty lies in showing Markov Networks (MNs), widely known as generative models (Gordon and Hernandez-Lobato, 2020), as models with a real discriminative capacity when used in conjunction with the Von Newman Morgenstern theorem (VNM theorem), a cornerstone of Game Theory with an extensive background also in Decision Theory (Machina, 1982; Delbaen et al., 2011). In order to faithfully model the RNN reality, we have used the dynamic version of the MNs (TD-MRFs). It is worth noting the versatility of our proposal, which can be also applied to other data-driven methodologies provided that they are regulated by dynamical systems.

Markov Random Fields (MRFs) are also known as Markov Networks (MNs) in those contexts that require highlighting the undirected graph condition (Dynkin, 1984). MRF-type graphical models have experienced a resurgence in recent years. In its origins, they exclusively performed functions related to image processing such as restoration or reconstructing. Later works such as García Cabello (2021) or Wang et al. (2022) have acknowledged their high predictive capability due to the equivalence between MRFs and Gibbs distributions, which provides an explicit expression of the prior likelihood after appropriate choice of the energy functions. MRF solutions are widely regarded as generative models as opposed to discriminative approaches, more related to tasks which involve classification.

Regarding the literature review, the selection of optimal training sets has not been studied in a general framework so far. To this author’s knowledge, this is the first analysis that aims to provide guidance for a general context. Published papers have studied this issue either only in contexts of ANN classification tasks or in very precise scenarios (electrical, financial or chemical engineering) taking advantage of their specific techniques. Within the first category, the genetic algorithm (GA) is widely used as a tool to create high-quality training sets as a the first step in designing robust ANN classifiers, see Reeves and Taylor (1998), Reeves and Bush (2001) or more recently, the paper (Nalepa et al., 2018). Disadvantages of using GA, apart from slowness, include that it is computationally expensive and too sensitive to the initial conditions. In our proposal, however, the calculation of the probabilities associated with the utility (i.e. efficiency as estimators) of the inputs is very simple and therefore does not add computational cost.

Within the second category, in the paper (Zapf and Wallek, 2021), the authors made a comparison between existing methods in the area of chemical process modelling in order to split a training set from a given data set. In Wong et al. (2016), the authors proposed a data selection for statistical machine translations, based on recursive neural networks which can learn representations of bilingual sentences. The paper (Fernandez Anitzine et al., 2012) analyses through a very context-specific instrument (ray-tracing) the ANN optimal selection of training set in the context of predicting the received power/path loss in both outdoor and indoor links. In Kim (2006), authors propose a GA approach for ANN instance selection for financial data mining.

As for the use of MNs/MRFs (prior probability) for problems which involve probability a posteriori, in the literature the terms “MNs/MRFs” and “discriminative” appear together only and exclusively to refer to discriminative random fields (DRFs) or equivalently conditional random fields (CRFs), both type of random fields which provide by definition a posterior probability.

The rest of the paper is structured as follows: preliminaries of Section 2 include basic knowledge on preference relations and VNM theorem, MNs and RNN functioning. Section 3 structures the steps to be followed to reach a solution to the proposed problem. The design of an abstract TD-MRF-based framework is performed in Section 4 which will subsequently allow the computation of prior probabilities associated with the VNM theorem. A TD-MRF structure for the input sets is also provided here. In Section 5, the expected utility theorem is applied after proving that the conditions for doing so are met. Section 6 highlights (and proves) the main results of our work. In Section 7, an example of the method application is developed. Section 8 finally concludes the paper.

2Preliminaries

2.1The Von Neumann-Morgenstern Theorem

When facing a situation of uncertainty (known as lottery), there is a set X which contains all possible outcomes (results) after the process has been completed. Each of these has associated a probability p of occurrence. The tools for managing the idea of “preferring” one outcome over another and the “benefit associated with a preference” are related to the definition of preference relation (see Jiang and Liao, 2022) and utility functions respectively.

Mathematically, a preference relation is a binary relation ⪰ in a set X of possible outcomes, such which is rational, i.e. that it satisfies the following properties:

  • completeness: for all xi,xjX, either xixj or xjxi or xixj (indifference) and

  • transitivity: for all xi,xj,xlX if xixj and xjxl, thus xixl.

The instrument that allows to quantify the benefit of each possible scenario is the utility function u: they assign a numerical label to each outcome so that outcomes can be compared to make a decision.

The Von Neumann-Morgenstern expected utility theorem (VNM theorem), (Yang and Qiu, 2005; Pollak, 1967) is a simple and very efficient result in Decision Theory which allows to compare numerically (through a utility function) the possible outcomes resulting from a process under uncertainty (Van Den Brink and Rusinowska, 2022). Under some axioms the ordinal preference relation is representable by a cardinal (expected) utility function, known as VNM utility function. Moreover, the VNM theorem shows that the expected utility of a lottery can be computed as a linear combination of the corresponding utilities by using the probabilities as linear coeficients:

Theorem 2.1

Theorem 2.1(VNM Expected Utility).

Let X be a set of outcomes and a preference relationon X that satisfies the hypothesis of

  • Continuity. The following formulations of continuity are equivalent:

    • if each element xn of a sequence of outcomes is xnx, thus limnxnx,

    • x1,x2,x3X with x1x2x3p[0,1]x2[p:x1;1p:x3],

    • x1,x2,x3X with x1x2x3p[0,1] such that x2px1+(1p)x3;

  • Independence (convex combination): xixjαxi+(1α)xlαxj+(1α)xl, α(0,1]andxlX.

Thus, there exists a continuous (utility) function u:X[0,1] with the following properties:
  • 1. x1x2 iff u(x1)u(x2);

  • 2. u([p1:x1;p2:x2;;pm:xm])=j=1mpju(xj).

Many authors have shown, however, that in practice the axiom of independence is not fulfilled (the top paper (Machina, 1982) talks about a “systematic violation in practice” of the axiom of independence, with the famous “Allais Paradox” as example).

In the paper (Machina, 1982), it is also shown that there are weaker conditions that lead to the same results as those stated in the VNM theorem. There, continuity is replaced by the weak convergence topology, which is the weakest topology for which the expected utility functional is continuous (see also Delbaen et al., 2011). On the other hand, the axiom of independence is replaced by the Fréchet differentiable condition on the functional form which defines the preferences (Machina, 1982).

2.2Basic Knowledge of Markov Networks MRFs

Let X={Xsi|siS,iN} be a set of random variables which take Xsi for any site22 siS. X is known as a stochastic process or a graphical model GM with underlying set of sites S. Both X and S are used interchangeably to represent a GM.

Graphical Models are commonly used to visually describe the probabilistic relationships amongst stochastic variables. Basic knowledge on GMs comprises the concepts of neighbourhood of a site and clique: sites si and sj are adjacent, sisj, if there is at least one edge that links them. GMs are called connected if for any two sites there is a path –a sequence of edges–, which connect them. Neighbourhood of a site si, denoted by N(si), is the set of sites which are adjacent to si: N(si)={sjS|sjsi}. Cliques are maximally connected subgraphs of the underlying graph S in the usual topological sense: they are connected and no more sites can be added and still be connected. Markov random fields (MRF’s) are GMs whose underlying graph is undirected.

Dynamic graphical models (DGMs) are the time-varying version of GMs. The set of dynamic stochastic variables will be denoted by Xt={Xsit|siS,iN}: these are node-dynamic graphs (the ones considered here) although edge-dynamic graphs could be also taken into account or even the possibility of both S and E varying over time. TD-MRFs are defined from a generalization of the usual markovian property. It states that the global probability of occurrence may be deduced from a local probability, when “local” refers to the neighbouring system: Xt is said to be a TD-MRF if

P[Xsit=xsi|XS{si}t=x]=P[Xsit=xsi|XN(si)t=x],
where P[Xt]=P[{Xsit|siS,iN}]= {P[Xsit=xsi]|siS,iN} denote the joint distribution of Xt. The Hammersley-Clifford theorem, under the “positivity condition”, sets the equivalence between MRF and Gibbs distribution, i.e. a joint distribution function which may be expressed in terms of functions ψ of Xt={Xsit|siS,iN}, which takes values only on the cliques C, written as ψC(Xt). The energy functions ψC(Xt) determine in a clear-cut manner the joint distribution:
P[Xt]=1Zexp[cliquesCψC(Xt)]=exp[cliquesCψC(Xt)lnZ],
such that all ψC depend on the clique C but have a common domain. When former expression is transformed into
(1)
P[Xt]=1ZcliquesCexp[ψC(Xt)],
energy functions ψC(Xt) are called clique potentials when viewed as factors (exp[ψC(Xt)]). The function Z is known as “the partition function” which acts as a normalizing constant to ensure that the distribution sums up to 1. We shall refer to (1) as a Gibbs distribution.

2.3Functioning of Recurrent Neural Networks RNNs

Neural networks (NNs) have a general functional definition as composition of parametric functions which disaggregates the linear component of the non-linear activation function (see García Cabello, 2023). Recurrent Neural networks, RNNs (Chou et al., 2022; Zhang et al., 2014), are a particular case of NNs which operate on time sequences and exhibit a special ability for learning lengthy-time period dependencies. Their functioning lies in an intermediary layer h=(h0,,hT) of hidden states ht in such a way that the data cycle through a loop to this layer. The output is reached by recursive applying the following functions:

(2)
h1=σ(Wxhx1+Whhh0+bh)ht=σ(Wxhxt+Whhht1+bh)zt=σ˜(Whzht+bz)
for weighted matrices, Wxh,Whh,Whz, and bias vectors, bh,bz, and where zt is the final output and σ,σ˜ are point-wise nonlinearities (activation functions which are applied component-wise). In RNNs, activation σ˜ is the sigmoid, monotonically increasing with range (0,1). For any RNN input xt, its corresponding output is z[xt](0,1), denoted as zt for simplicity.

The loss function loss is chosen depending on the RNN task: usually it is Mean Square Error MSE = 1ni=1n(ziyi)2, often used as SE=12ni=1n(ziyi)2 for cancelling the constant when computing the gradient, where zi and yi represent the RNN output and the target value respectively. RNN functioning is shown in Fig. 1.

Fig. 1

RNN learning process.

RNN learning process.

3Problem Formulation

In this section we will develop the theoretical framework that will capacitate Markov Network-methodology to perform discriminative tasks based on prior probability and the VNM Theorem 2.1. Specifically, our aim is to design a mathematical model which enables MNs to identify the optimal RNN training sets, i.e. those that produce better estimates in forecasting taks. For a better understanding, we will make reference to a real example of a RNN forecasting process:

infor530_g002.jpg

Recall that, in the networks as a whole, Recurrent Neural Networks, RNNs, stand out for their predictive abilities based on their potential in processing temporal data. Thus, we are facing some time-dependent learning process by using RNNs where the superscript t denotes time in all cases (no distinction will be made between matrices and their transpose in order to avoid confusion between t transpose and t time).

As is well known, in RNN prediction tasks, training sets are fed with temporal sequences composed with pairs of previous data of the form (input, labels) with the objective of predicting the future, referred to as future target value, y. Thus, Xt is the time-varying variable which needs to be estimated, Zt is the set of RNN outputs and Yt the set of labels (i.e. past target values for training). In terms of the illustrative example: to predict the electrity price one month ahead (target value y, t=+1), the training sets are made up of temporal sequences (xit,yit) whose first component contains different factors that influence electricity prices (fuel prices, transmission and distribution costs, weather conditions, etc.) and whose second component is the electrity price, both xit, yit in site i and prior to present time (t=9,,2,1,0).

Let In be the (time-dependent) input RNN dataset formed by n vectors:

In={xit=(xi1,t,xi2,t,,xid,t),i=1,,n},
each of them contains d features: xij,tR, i=1,, n;j=1,,d: infor530_g003.jpg Zt[In] will stand for the set of outputs which is generated after the completed RNN process corresponding the the set of inputs In.

The objective is to select from the n inputs in In, D<n vectors that will make up the training set: the selection criterion is to choose those which produce the best estimates in the RNN learning process. The choice will be conducted by only using prior probability –instead of posterior– and the VNM Theorem 2.1.

We will list the steps to be taken:

  • Main lottery (in In). The key idea is to adapt the scenario of the RNN learning process (inputs, outputs, labels, target value) to the situation described in the VNM-Theorem 2.1, and to rely on the TD-MRF methodology for computing the probabilities. To do so, we consider the process of selecting D<n vectors as a lottery whose outcomes are the D vectors we have chosen for the training set (main lottery). Each of these D vectors has its own likelihood of being chosen.

  • Consider the RNN process as composed by several uncertain processes, RNNi, one for each of the input vectors xit that make up the set In, i=1,,n (Fig. 1). After the RNN learning is completed, a set of n outputs Zt[In] is generated. Note that, since RNN is deterministic (the same set of inputs will produce the same set of outputs), input xit has a uniquely associated output zit.

  • Secondary lottery (for each zitZt[In]). A second lottery arises for each output zitZt[In] when the selection criterion is applied: how good the estimate zit is. If M denotes the threshold below which the loss function is considered acceptable, the secondary lottery is whether the loss function loss(zit)<M or not.

    To ensure that our selection rule is applied, we define a preference relation on In={xit,i=1,,n}. Due to the unique existing direction from input xit to RNN output zit, the preference relation can be considered defined on both the set of inputs In and the set of outputs Zt[In]. Moreover, for the evident bidirection between outcome zit and associated loss loss(zit), the distinction that many authors make between defining a preference relation on the set of outcomes X or defining it on the set of lotteries over X, named ΔX, does not apply in our case:

    Definition 3.1.

    For zit,zjtZt[In], zitzjtloss(zjt)<loss(zit), i.e. zjt is preferred to zit if zjt is a better estimate (smaller associated loss). Indifference comes with the equality: for zit,zjtZt[In] zitzjtloss(zit)=loss(zjt).

    In later sections we will show that this preference relation verifies the properties necessary for the application of the VNM theorem.

  • We thus apply Theorem 2.1 for the main lottery: the expected utility of training set Tr is given by

    (3)
    u[Tr]=u[{x1t,,xDt}]=i=1Dpo[Xt=xit]·u(xit).

  • In order to compute the (prior) probabilities po[Xt=xit], we shall equip the initial RNN data set with an undirected graph structure –with appropriate node and edge definitions– which will later be shown to be an MRF by application of central Theorem 4.4. This will be the development of Section 4.

  • In order to compute the utility u(xit), we shall apply again Theorem 2.1 for the secondary lottery. This will be performed in Sections 5 and 6.

4The Abstract TD-MRF-Based Framework

Here, we first design an abstract graph-based framework that shall provide a model for a dynamic context of k sites and r filters for data discrimination (corresponding to r cliques) for which it will be shown that it is an TD-MRF under certain mild conditions. Such TD-MRF will be the core in the computation of prior probabilities po[Xt=xit] of the VMN Theorem 2.1.

Let S be a set of k sites si, S={si|i=1,2,,k} such that the variable may take different values depending on the site, Xsit. For each site si, the variable Xsit can be disaggregated into a set of d characteristics which fully describes Xsit in the instant of time t: Xsit=(Xsi1,t,Xsi2,t,,Xsid,t), with lower case xsit=(xsi1,t,xsi2,t,,xsid,t) for the feature vector (i.e. a numerical vector corresponding to a realisation of the variable). We shall use indistinctly xsit=(xsi1,t,xsi2,t,,xsid,t) or xit=(xi1,t,xi2,t,,xid,t) either for upper and lower case. In a compact form, xij,tR,i=1,,k;j=1,,d.

Remember that two random variables are equivalent if they have identical distribution. Then, we will define the edges by equivalently defining the neighbourhood N of a site: N(Xsit)={Xsjt|Xsjt,Xsitare equivalent}={Xsjt|P[Xsjtx]=P[Xsitx]x}.

Definition 4.1

Definition 4.1(TD-DGM).

A DGM (St,Nt) is defined as follows: sites are the elements of the set St through the identification sitXsit. Edges are defined through the neighbourhood N of a site sit as N(Xsit)={Xsjt|P[Xsjtx]=P[Xsitx]x}.

Remark 4.2

Remark 4.2(Clean data).

It is worth highlighting that Definition 4.1 (that makes equal all random variables with identical probability distribution) avoids duplicates. This is particularly important when applied to the graphical model resulting from an RNN input dataset (clean data).

Recall that marginal distribution is also known as prior probability in contrast with the posterior distribution (the conditional one). From the former definition, sites in the same neighbourhood have the same prior probability. Moreover,

Proposition 4.3.

Sites which belong to the same neighbourhood have identical probability a posteriori.

Proof.

Let si, sj be two sites which belong to the same neighbourhood. From the Bayes’s theorem, one has infor530_g004.jpg

The following theorem proves then that the DGM defined in Definition 4.1 is a TD-MRF by equivalently showing that the Markov condition:

Theorem 4.4

Theorem 4.4(The TD-MRF model).

DGMs as in Definition 4.1 are TD-MRFs.

Proof.

We shall prove that the local dynamic Markov property is verified, i.e. the probability of Xsi conditioned to the remaining random variables, Xsj,ji, is equal to the probability of Xsi conditioned to the random variables in the same neighbouring system XN(si). The following development is considered in a specific time instant t0. By assuming that the si-site estimation takes value xsit0 while the rest does not (thus, their estimation is some xxsit0), we partition the set S as S{si}=N(si){si}Not(si), where Not(si) denotes those sites which are not in N(si). Thus,

P[Xsit0=xsit0|XS{si}t0=xxsit0]=P[(Xsit0=xsit0)(XS{si}t0=xxsit0)]P[XS{si}t0=xxsit0]==P[(Xsit0=xsit0)(XN(xsit0)t0=xxsit0)]P[XN(xsit0)t0=xxsit0]==P[Xsit0=xsit0|XN(xsit0)t0=xxsit0].
 □

Insofar as the distribution function is the tool used to make estimates, previous Theorem 4.4 provides a joint measure of how close the variable is to taking a particular value.

Corollary 4.5.

The corresponding Gibbs (joint) probability distribution at a time instant t provides that the likelihood of reaching a concrete value x is P[Xt=x]=1ZcliquesCjj=1reψC(Xt=x).

Remark 4.6.

Under Definition 4.1, neighbours and cliques are essentially the same and equal to the set of random variables with identical prior distribution. Moreover, according to Proposition 4.3, variables in a clique have also the same posterior distribution.

Each clique has its own common estimation function:

Proposition 4.7

Proposition 4.7(The cliques).

Each clique of the TD-MRF has its own estimation function given by the clique potentials eψCj([.]),j=1,,r: PCj[Xt=x]=1ZeψCj(Xt=x).

Proof.

It is straightforward from definition of clique.  □

In discrimination/classification works, the commonly used probability is the conditional or posterior probability P[Xt=x|y]=1ZcliquesCjj=1reψC(Xt=x|y) where x stands for RNN input, y represents the corresponding target value and ψC(Xt=x|y) means that the energy function ψC only assigns to those inputs x that correspond to y.

4.1Visual Flowchart of the TD-MRF Operational Process

Inputs for a TD-MFRs are specific values of the stochastic variable Xt in a particular time instant t and a site si, xsit. Depending on the context, Xt can be considered as disaggregated into a set of d features for each site si. Thus, each input is xit=(xi1,t,xi2,t,,xid,t), xij,tR, i=1,,k, j=1,,d such that each univariate component xij,tR can be regarded as input of a single-variable for d processes. The TD-GDM operational process is as follows: by assuming there are r cliques, C1,,Cr, each input is processed in a clique Cj,j=1,,r, through the corresponding clique potential eψCj([.]),j=1,,r, such that the output is the estimate made by the corresponding clique (according to Proposition 4.7). Each of these is thus aggregated in a final output by means of the expression given in Corollary 4.5. If, on the other hand, the variable is not considered disaggregated, multivariate energy functions will process the information in a single multivariate execution.

Fig. 2

TD-MRF operational process.

TD-MRF operational process.

Figure 2 provides a visual representation of a TD-MRF operation in the univariate scenario, where the energy functions ψ are of Gaussian type, i.e. ψCj=([.]μj)2σj2, j=1,,r. Note that for xit=(xi1,t,xi2,t,,xid,t), the corresponding probability po[xit] is po[xit]=(po[xi1,t],po[xi2,t],,po[xid,t]).

Moreover, the operation of an MRF is equally applied in the calculation of following probabilities:

PCj[Xtxt]=xxtP[Xt=x],PCj[xpr<Xtxim]=xximP[Xt=x]ixprP[Xt=x],PCj[Xt>xt]=1PCj[Xtxt]=1xxtP[Xt=x].

4.2TD-MRF Structure for the RNN Input Set

Our goal here is to provide a graph-structure (which will become TD-MRF structure according to Theorem 4.4) for the set In of RNN inputs In={xit=(xi1,t,xi2,t,,xid,t),i=1,,n}, each xij,tR, i=1,,n; j=1,,d.

To achieve this, the steps to follow are listed below:

  • 1. The pre-processing phase. Before training/testing an RNN, data must suffer some preprocessing steps for removing duplicates, unnecessary information and simulating missing data. Depending on the context, data must also be normalised with feature scaling. We shall assume that RNN input datasets have completed this phase.

  • 2. The input set graph structure. We endow the RNN data set with an undirected graph structure, according to Definition 4.1, where k=D:

    • sites sit are the local version of the random variable Xt, i.e. the random variable Xit for which the set of inputs xit are a realisation of it: sit=Xsit.

    • The neighbourhood of a site is defined as

      N(Xsit)={Xsjt|Xsjt,Xsitare equivalent}={Xsjt|P[Xsjtx]=P[Xsitx]x}.
      in the sense of Definition 4.1. According to Remark 4.2, this definition is intended for discarding duplicates in the datasets.

  • 3. The corresponding Gibbs distribution. According to Theorem 4.4, the graphical model formed by the RNN input data viewed as a time varying random variable Xt is TD-MRF, with a Gibbs distribution P[Xt]=1ZcliquesCexp[ψC(Xt)].

  • 4. Likelihood of occurrence of an output. Recall that from Definition 4.1, neighbours are equal to cliques and both consist of those random variables which have equal prior distributions (and identical posterior distribution in consequence). Hence, a probability po[Xt=xit] is biunivocally determined for each input xit. We also refer to Fig. 2 to reproduce the operation of an TD-MRF in a single multivariate execution.

    Note that the deterministic nature of RNN, which always assigns the same zit to the same xit, allows also defining the probability of occurrence of an RNN output po[zit] as

    po[zit]:=po[Xt=xit].

5Application of the VNM Utility Theorem

In this section, we will further develop equation (3),

u[Tr]=u[{x1t,,xDt}]=i=1Dpo[Xt=xit]·u(xit),
by giving the explicit calculation of u(xit), i, t. To do so, we will again apply the VMN theorem to the secondary lottery.

Recall that in the preceding sections we have considered a second lottery which arises naturally when we test how good the ouput zitZt[In] is as estimate. If M is the threshold below which the loss function is considered acceptable, the secondary lottery is whether the loss function loss(zit)<M or not. With the intuitive notation of the paper (Machina, 1982), our secondary lottery is a probability distribution function over the interval [0,M] (the set of all lotteries over [0,M] is denoted in Machina (1982) by D[0,M]) and the preference functional V(·) on D[0,M] in our case is loss(·). Note that Definition 3.1 of preference over the set Zt[In] can be additionally considered over In for the deterministic assignment xitzit: for xit,xjtIn

xitxjtzitzjtloss(zjt)<loss(zit),
where the loss function is considered in a squared-form, loss(zit)=(zityi)2 which in the context or real positive numbers is equivalent to loss(zit)=|zityi|. The objective here is to prove that this binary relation satisfies the necessary conditions that enable the application of the Von Neumann-Morgenstern (VNM) theorem.

First of all, it has to be shown that it is a preference relation:

Proposition 5.1.

The relation defined in Definition 3.1 is a preference relation.

Proof.

The standard axiom for a preference relation is rationality which includes both completeness and transitivity.

  • Completeness: for all zit,zjtZt[In], either zitzjt or zitzjt. This property comes from the fact that R has a total order when applying to |zityi|,|zjtyi|R.

  • Transitivity: for all zit,zjt,zltZt[In], if zitzjt and zjtzlt, thus zitzlt. This property holds by the transitivity in the order of R: |zityi||zjtyi||zltyi| |zityi||zltyi|.

 □

Moreover, former preference relation satisfies the following conditions:

Proposition 5.2.

The preference relation in Definition 3.1 satisfies:

  • Continuity: if znt of a sequence of outcomes with zntz, thus limnzntz.

  • Fréchet differentiable: the functional form which defines de preferences (the loss function) must be Fréchet differentiable.

Proof.

Continuity: let us prove that if each element znt of a sequence of outcomes is zntz, thus it must be limnzntz. First, we have that

(zntyn)2=(znt)22znt·yn+yn2limn(zntyn)2=limn(znt)2limn(2znt·yn)+limnyn2==limn(znt)22limnznt·limnyn+limnyn2==(limnzntlimnyn)2.

Let us suppose that znt is a sequence of outcomes such that zntz, that is, output z is preferred to outputs znt. That means that (zy)2<(zntyn)2 where y, yn are the corresponding target values for the inputs that correspond to z and znt, respectively.

By assuming that both limits exist, the inequality (zy)2<(znyn)2 is preserved in “less than or equal” form: limn(zy)2=(zy)2limn(znty)2. Hence,

(zy)2limn(znty)2=(limnzntlimnyn)2limnzntz.

Fréchet differentiable. Recall that the Fréchet derivative in finite-dimensional spaces is the usual derivative. Thus, the loss function loss(zit)=(zity)2 verifies this hypothesis.  □

Therefore, the preference relation stated in Definition 3.1 verifies the conditions required for the application of the VNM Theorem 2.1. Thus, there exists an utility function u:In[0,1] which quantifies the preferences: x1tx2t iff u(x1t)u(x2t). Moreover, the Von Neumann-Morgenstern Expected Utility theorem also provides guidance for computing the utility of the set of inputs through the explicit formula given in equation (3):

u[ii]=u[{x1t,,xDt}]=i=1Dpo[Xt=xit]·u(xit).

6Main Results

The objective of this section is to highlight (after proving) the main results of our proposal. First of all, we define the level of efficiency of an input set In, Eff[In]. Thus, next Theorem 6.2 proves the existence of a formula for determining the efficiency of a training set while Theorem 6.3 gives an explicit expression for computing Eff[In].

Definition 6.1.

We define the level of efficiency of an input set In, Eff[In] as the expected utility of training set Tr: Eff[In]:=u[{x1t,,xDt}]=i=1Dpo[Xt=xit]·u(xit).

Theorem 6.2

Theorem 6.2(Existence).

For any input set In, there exists an utility function u:In[0,1] which allows to quantify its level of efficiency Eff[In] such that this can be computed as

Eff[In]=i=1Dpo[Xt=xit]·u(xit).

Proof.

Following Proposition 5.2, the preference relation stated in Definition 3.1 verifies the necessary conditions for applying the VNM Theorem 2.1. According to this, there exists an utility function u:In[0,1], which quantifies the preferences over input vectors: x1tx2t iff u(x1t)u(x2t).  □

Theorem 6.3

Theorem 6.3(How to compute the level of efficiency of In).

We assume that all inputs are equally distributed. Thus, for any RNN input set In, its level of efficiency can be explicitely computed as

Eff[In]=p·i=1Dpo[Xt=xit],wherepis the probability that|zity|<M,zit.

Proof.

We start from expression Eff[In]=i=1Dpo[Xt=xit]·u(xit) derived from Theorem 6.2. On one hand, the explicit computation of probabilities {po[Xt=xit]}i=1D is given by the TD-MRF operational process defined in Section 4 (see Fig. 2, which visually describes this). On the other hand, in order to compute the utilities {u(xit)]}i=1D, we shall apply again the VNM Theorem 2.1 for the secondary lottery. Recall that the secondary lottery for each zit is whether the loss function |zity|<M or not, for a given threshold M below which the loss function is considered acceptable. Recall also that the probability of occurrence of an RNN output po[zit] is po[zit]=po[Xt=xit]. To overcome a continuous space of outcomes {|zity|R}i=1D for the lottery associated to each zit, we represent it in a binary form:

outcome1if|zity|<M,outcome0if|zity|M.
Since the utility function u represents a set of outcomes in the sense of VNM-theorem 2.1 (x1x2 iff u(x1)u(x2)), we can conclude that
outcome1if|zity|<Mu(1)=1,outcome0if|zity|Mu(0)=0.
Then, by application of VNM-Theorem 2.1 on each secondary lottery, one has that
u(xit)=u(zit)=i=12p[outcomei]·u(outcomei)=p[outcome1]·1=p[outcome1],i=1,,D.
The formula comes then by substituting:
Eff[In]=i=1Dpo[Xt=xit]·u(xit)==i=1Dpo[Xt=xit]·p[outcome1]==p[outcome1]·i=1Dpo[Xt=xit]==p·i=1Dpo[Xt=xit].
 □

7A Case Application

This section is aimed at developing an example of the method application. In order to make a choice between two real data sets, their level of efficiency will be computed. As stated in Definition 6.1, the level of efficiency of an input set Eff[In] measures the learning capacity of its inputs on the basis of the preference relation of Definition 3.1 that prioritises those inputs whose outputs have a lower associated loss (better estimates). To compute Eff[In] we shall apply the formula proved in Theorem 6.3 which follows from Theorem 6.2, in which the existence of an utility function u which quantifies the preference relation was proved. This is

Eff[In]=i=1Dpo[Xt=xit]·u(xit)=p·i=1Dpo[Xt=xit],
where p is the probability that |zity|<M,zit for a given threshold M below which the loss function is considered acceptable.

The data sets we shall use here contain prices (€/1 kg) over time for the most common olive oil varieties. These data are available on the websites of the Government of Spain:

https://www.mapa.gob.es/es/agricultura/temas/producciones-agricolas/aceite-oliva-y-aceituna-mesa/Evolucion_precios_AO_vegetales.aspx,

where prices are published on a weekly basis. Specifically, we will consider data sets corresponding to weeks 28/2023 and 39/2022 (the reasons for this choice will be explained later). We shall thus apply the above formula taking into account the following considerations. On one hand, note that the choice of the threshold M will depend on the context. In the olive oil market scenario, assuming a deviation from the olive oil price of 10% is acceptable. Hence, M=0.5. On the other hand, the value of the parameter p will depend on several market features which vary over time depending on physical (rainfall, pollen levels) and socio-economic (Government regulations) circumstances. When real prices are subject to unusual fluctuations (due to changes in the aforementioned circumstances), such prices used in training tasks will deviate more from the real ones. In consequence, the probability p, such that |zity|<M,zit, will decrease. Either way, it should be noticed that p is known as soon as the value of M is fixed, but it is not the same for all input sets.

From the above formula, since M is known (and therefore so is p), we must focus on computing po[Xt=xit] for each input xit. To that end, we will follow Fig. 2 of the TD-MRF operational process given in Section 4.1. Since each input xit may be disaggregated into d features xik,t (k=1,,d), xit=(xi1,t,xi2,t,,xid,t), the same is true for the corresponding probability po[Xt=xit]=po[xit], which is po[xit]=(po[xi1,t],po[xi2,t],,po[xid,t]).

We assume that the energy functions ψ corresponding to the r cliques are of Gaussian type, i.e. ψCj=([.]μj)2σj2, j=1,,r, since Gaussian distributions are suitable in the olive oil scenario. Actually, given that Gaussian distributions portray those data sets whose majority of elements revolves around the centre, energy functions of Gaussian type are particularly suited for goods whose price takes values in an interval of small length and do not suffer very sharp price variations.

In order to achieve our goal, each input xit must be first processed in each clique Cj,j=1,,r, through the corresponding clique potential, whose result is

PCj[Xt=xit]=1Ze(xitμj)2σj2PCj[Xt=xit]=e(xitμj)2σj2,whenZ=1.
Once the processing in the cliques has been completed, the required probability is obtained as
po[xit]=P[Xt=xit]=1ZjrPCj[Xt=xit]P[Xt=xit]=jre(xitμj)2σj2,whenZ=1.
The computation of “the partition function” Z entails certain difficulties in practice. For this reason, we shall adopt the view commonly taken in the literature that Z=1.

From the TD-MRF structure proved in Theorem 4.4, cliques gather those random variables with identical prior and posterior distribution (see Remark 4.6). This theoretical description fits with the specialist major retailers in the olive oil context. In this practical case, the level of efficiency shall be computed through PCj[Xt=xit] of cliques Cj,j=1,,7 supported by the information given in Table 8 below (source: https://www.olimerca.com/precios/tipoInforme/3).

Table 1

In1: Processing in the clique C1.

In1: Processing in the clique C1.
Table 2

In1: Processing in the clique C2.

In1: Processing in the clique C2.

As discussed before, there are multiple factors (physical and socio-economic) that influence the price. Such factors are the features xik,t (k=1,,d) of each input xit=(xi1,t,xi2,t,,xid,t). For simplicity, we focus on just one of them in order to choose the two input sets In1,In2: the hydrographic index, which shows the average rainfall over a certain period of time. In this line, In1 is an input set of prices (€/1 kg) under severe drought conditions (and therefore, with unusual fluctuations in price as product shortages raise the prices33) while In2 is an input set of prices which correspond to a usual period of rainfall:

In1={3.16,5.92,6.26,6.52,7.10},p=0.23,week28/2023,In2={2.71,3.78,3.80,3.86,3.96},p=0.57,week39/2022.
With this choice of In1 and In2, it is to be expected that the inputs in In1 will produce worse estimators (higher associated loss) since such inputs reflect prices with unusual fluctuations (therefore with higher deviation from the mean). Hence, it is is to be expected that Eff[In2]>Eff[In1].

The computation of level of efficiency In1 is supported by the information provided in Tables 17.

Table 3

In1: Processing in the clique C3.

In1: Processing in the clique C3.
Table 4

In1: Processing in the clique C4.

In1: Processing in the clique C4.
Table 5

In1: Processing in the clique C5.

In1: Processing in the clique C5.
Table 6

In1: Processing in the clique C6.

In1: Processing in the clique C6.
Table 7

In1: Processing in the clique C7.

In1: Processing in the clique C7.
Table 8

Mean and variance of cliques C1C7.

Mean, VarianceC1 AhorramasC2 AlcampoC3 CarrefourC4 DiaC5 HipercorC6 LidlC7 Mercadona
μi3.9883.8843.8513.8584.3433.8783.916
σi20.1720.1560.1430.1080.4720.1340.117

From the information provided by the above tables, finally the required probability is computed (see Table 9).

Eff[In1]=p·(po[Xt=3.16]+po[Xt=5.92]+po[Xt=6.26]+po[Xt=6.52]+po[Xt=7.10])=0.23·(2.09102E12+2.55039E74+3.7242E101+1.7143E124+2.4066E185)=0.23·2.09102E12=infor530_g013.jpg

Table 9

Aggregated probability for In1.

Aggregated probability for In1.

Similarly, the computation of In2 is supported by the information provided in Tables 1016.

Table 10

In2: Processing in the clique C1.

In2: Processing in the clique C1.
Table 11

In2: Processing in the clique C2.

In2: Processing in the clique C2.
Table 12

In2: Processing in the clique C3.

In2: Processing in the clique C3.
Table 13

In2: Processing in the clique C4.

In2: Processing in the clique C4.
Table 14

In2: Processing in the clique C5.

In2: Processing in the clique C5.
Table 15

In2: Processing in the clique C6.

In2: Processing in the clique C6.

Finally, the required probability is computed (see Table 17).

Table 16

In2: Processing in the clique C7.

In2: Processing in the clique C7.

Eff[In2]=p·(po[Xt=2.71]+po[Xt=3.78]+po[Xt=3.80]+po[Xt=3.86]+po[Xt=3.96])=0.57(3.24825E30+0.268808808+0.337851876+0.53631732+0.549630567+1.692608571)=infor530_g022.jpg.

Table 17

Aggregated probability for In2.

Aggregated probability for In2.

Hence, as expected (since the inputs of In1 reflect prices with unusual fluctuations and therefore with higher deviation from the mean) Eff[In2]>Eff[In1].

8Conclusions

This paper deals with the selection of optimal training sets (those that have a higher capacity as estimators) in Recurrent Neural Networks under prediction tasks (or pattern recognition with time series as inputs), although this may also apply to other data-driven models regulated by dynamic systems. Our objective is to fill the existing gap of clear guidelines to follow for selecting optimal training sets in a general context.

We design here a novel methodology to select optimal training data sets that can be used in any context. The key idea, which underpins the design of the mathematical structure that supports the selection, is a binary relation that gives preference to inputs with higher estimator abilities. A second novelty of our approach is to use dynamic tools that have not been used previously for this purposes: dynamic Markov Networks, which are widely regarded as generative models, successfully compute the prior probabilities involved in the formula for calculating the degree of efficiency of the training set (Theorem 6.3), derived from application of the Von Neumann-Morgenstern Theorem 2.1. It is precisely the VMN theorem the instrument that confers discriminative capacity to the MNs: in this work we show that the preference relation that we define between inputs of a training set (inputs with higher learning capacities are preferred in the sense that the error function takes lower values) fulfils the necessary hypotheses to derive the existence of a simple formula for the calculation of the utility (efficiency) of a training set.

The simplicity of this calculation allows it to be carried out in parallel with the learning process without adding computational cost. Thus the optimal sets are selected as the learning process evolves, therefore the data noise gradually disappears which decreases the likelihood of overfitting occurring.

Declarations of interest: none.

Notes

1 Overfitting in data-driven learning models (which extract a predictive model from a data set) is the flaw of failing to generalise the features/patterns present in the training dataset. It occurs in models which extract features from datasets having too much noise.

2 We are using the term “site” instead of node for its additional connotation of location, which allows us to simulate that each of the nodes is a different geographical place that could generate its own estimate of the variable as explained below, in Proposition 4.7.

3 In dry seasons, water shortages lead to a drop in the olive production and, therefore, in the olive oil production. Hence, under severe drought conditions, olive oil prices skyrocket.

References

1 

Chen, A.N. ((2006) ). Robust optimization for performance tuning of modern database systems. European Journal of Operational Research, 171: , 412–429. https://doi.org/10.1016/j.ejor.2011.03.043.

2 

Chou, P., Chuang, H.H., Chou, Y., Liang, T. ((2022) ). Predictive analytics for customer repurchase: Interdisciplinary integration of buy till you die modeling and machine learning. European Journal of Operational Research, 296: , 635–651. https://doi.org/10.1016/j.ejor.2021.04.021.

3 

Delbaen, F., Drapeau, S., Kupper, M. ((2011) ). A von neumann morgenstern representation result without weak continuity assumption. Journal of Mathematical Economics, 47: , 401–408. https://doi.org/10.1016/j.jmateco.2011.04.002.

4 

Dynkin, E. ((1984) ). Gaussian and nongaussian random fields associated with markov processes. Journal of Functional Analysis, 55: , 344–376. https://doi.org/10.1016/0022-1236(84)90004-1.

5 

Fernandez Anitzine, I., Romo Argota, J.A., Fontan, F.P. (2012). Influence of training set selection in artificial neural network based propagation path loss predictions. International Journal of Antennas and Propagation, 2012. https://doi.org/10.1155/2012/351487.

6 

García Cabello, J. ((2021) ). A novel intelligent system for securing cash levels using markov random fields. International Journal of Intelligent Systems, 36: , 4468–4490. https://doi.org/10.1002/int.22467.

7 

García Cabello, J. (2023). Improved deep neural network performance under dynamic programming mode. Preprint. https://doi.org/10.2139/ssrn.4410415.

8 

Gordon, J., Hernandez-Lobato, J.M. ((2020) ). Combining deep generative and discriminative models for bayesian semi-supervised learning. Pattern Recognition, 100: , 107156. https://doi.org/10.1016/j.patcog.2019.107156.

9 

Higham, C.F., Higham, D.J. ((2019) ). Deep learning: an introduction for applied mathematicians. Siam Review, 61: , 860–891. https://doi.org/10.1137/18M1165748.

10 

Jiang, L., Liao, H. ((2022) ). Bounded rational reciprocal preference relation for decision making. Informatica, 33: , 731–748. https://doi.org/10.15388/23-INFOR511.

11 

Kim, K. ((2006) ). Artificial neural networks with evolutionary instance selection for financial forecasting. Expert Systems with Applications, 30: , 519–526. https://doi.org/10.1016/j.eswa.2005.10.007.

12 

Machina, M.J. ((1982) ). Expected utility analysis without the independence axiom. Econometrica: Journal of the Econometric Society, 50: (2), 277–323. https://doi.org/10.2307/1912631.

13 

Mirjalili, S., Hashim, S.Z.M., Sardroudi, H.M. ((2012) ). Training feedforward neural networks using hybrid particle swarm optimization and gravitational search algorithm. Applied Mathematics and Computation, 218: , 11125–11137. https://doi.org/10.1016/j.amc.2012.04.069.

14 

Nalepa, J., Myller, M., Piechaczek, S., Hrynczenko, K., Kawulok, M. ((2018) ). Genetic selection of training sets for (not only) artificial neural networks. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (Eds.), Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety, BDAS 2018, Communications in Computer and Information Science, Vol. 928: . Springer, Cham. https://doi.org/10.1007/978-3-319-99987-6_15.

15 

Pollak, R.A. ((1967) ). Additive von neumann-morgenstern utility functions. Econometrica, Journal of the Econometric Society, 35: 3–4, 485–494. https://doi.org/10.2307/1905650.

16 

Reeves, C.R., Bush, D.R. ((2001) ). Using genetic algorithms for training data selection in RBF networks. In: Liu, H., Motoda, H. (Eds.), Instance Selection and Construction for Data Mining, The Springer International Series in Engineering and Computer Science, 608: . Springer, Boston, MA, pp. 339–356. https://doi.org/10.1007/978-1-4757-3359-4_19.

17 

Reeves, C.R., Taylor, S.J. ((1998) ). Selection of training data for neural networks by a genetic algorithm. In: Parallel Problem Solving from Nature-PPSN V: 5th International Conference Amsterdam, The Netherlands September 27–30, 1998 Proceedings 5. Springer, pp. 633–642. https://doi.org/10.1007/BFb0056905.

18 

Smale, S., Rosasco, L., Bouvrie, J., Caponnetto, A., Poggio, T. ((2010) ). Mathematics of the neural response. Foundations of Computational Mathematics, 10: , 67–91. https://doi.org/10.1007/s10208-009-9049-1.

19 

Van Den Brink, R., Rusinowska, A. ((2022) ). The degree measure as utility function over positions in graphs and digraphs. European Journal of Operational Research, 299: , 1033–1044. https://doi.org/10.1016/j.ejor.2021.10.017.

20 

Wang, L., Zhou, Y., Li, R., Ding, L. ((2022) ). A fusion of a deep neural network and a hidden markov model to recognize the multiclass abnormal behavior of elderly people. Knowledge-Based Systems, 252: , 109351. https://doi.org/10.1016/j.knosys.2022.109351.

21 

Wong, D.F., Lu, Y., Chao, L.S. ((2016) ). Bilingual recursive neural network based data selection for statistical machine translation. Knowledge-Based Systems, 108: , 15–24. https://doi.org/10.1016/j.knosys.2016.05.003.

22 

Yang, J., Qiu, W. ((2005) ). A measure of risk and a decision-making model based on expected utility and entropy. European Journal of Operational Research, 164: , 792–799. https://doi.org/10.1016/j.ejor.2004.01.031.

23 

Zapf, F., Wallek, T. ((2021) ). Comparison of data selection methods for modeling chemical processes with artificial neural networks. Applied Soft Computing, 113: , 107938. https://doi.org/10.1016/j.asoc.2021.107938.

24 

Zhang, H., Wang, Z., Liu, D. ((2014) ). A comprehensive review of stability analysis of continuous-time recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems, 25: , 1229–1262. https://doi.org/10.1109/TNNLS.2014.2317880.

25 

Zhang, L., Suganthan, P.N. ((2016) ). A survey of randomized algorithms for training neural networks. Information Sciences, 364: , 146–155. https://doi.org/10.1016/j.ins.2016.01.039.