Neural network-based formula for shear capacity prediction of one-way slabs under concentrated loads

According to the current codes and guidelines, shear assessment of existing reinforced concrete slab bridges sometimes leads to the conclusion that the bridge under consideration has insufficient shear capacity. The calculated shear capacity, however, does not consider the transverse redistribution capacity of slabs, thus leading to overconservative values. This paper proposes an artificial neural network (ANN)-based formula to come up with estimates of the shear capacity of one-way reinforced concrete slabs under a concentrated load, based on 287 test results gathered from the literature. The proposed model yields maximum and mean relative errors of 0.0% for the 287 data points. Moreover, it was illustrated to clearly outperform (mean Vtest / VANN =1.00) the Eurocode 2 provisions (mean VE,EC / VR,c =1.59) for that dataset. A step-by-step assessment scheme for reinforced concrete slab bridges by means of the ANN-based model is also proposed, which results in an improvement of the current assessment procedures.


Introduction
The study of the dynamic behavior of beams on foundations subjected to moving loads with possible applications in high-speed railway track design has been a topic of interest in the literature. In particular, the existence of a critical velocity of the load for which the beam's 3 transverse velocity of the cross section, and (iii) Sign(̇( , )) =̇( , )/|̇( , )| if ̇( , ) ≠ 0 and Sign(̇( , )) = [−1 , +1] if ẇ( , ) = 0. The expression of the reaction force is an algebraic inclusion (Glocker 2001, Studer 2009, meaning that at the instants of vanishing velocity the reaction may belong to an interval and at the instants of velocity sign change the reaction per unit length is discontinuous. This reaction is very different from the one provided by a continuous distribution of the traditional linear viscous dampers, ( , ) = ̇( , ), where c is the viscous damping coefficient per unit length. In both cases the reaction opposes the velocity but, while viscous damping provides a reaction that is proportional to the local velocity itself, the frictional reaction is limited to the interval [-fu, + fu] and is independent of the magnitude of the velocity ̇( , ) (see Fig. 2 in Toscano Corrêa et al. 2018). In that study a time stepping algorithm specially designed to deal with non-smooth dynamical systems was for the first time applied to beams on distributed friction foundations and new conclusions on critical velocities, maximal displacements and dynamic amplification factors were drawn. Since the FE analyses in Toscano Corrêa et al. (2018) are (i) very time consuming (thus unfeasible for fast engineering estimations), and (ii) require advanced know-how and software that go beyond the available resources of typical civil engineering firms, this paper aims to 4 demonstrate the potential of Artificial Neural Networks (ANN) to effectively predict the maximum displacements in railway tracks on frictional foundations, as function of the frictional parameter (fu) and load velocity (v). This is an important step towards the future development of much more versatile ANN-based analytical models for the same type of problem. The difference will be the inclusion of more independent (input) variables, such as the foundation stiffness modulus k, the applied load magnitude F, and geometrical/mechanical properties of the railway beam (see Fig. 1).

FE-based model and data gathering
The authors considered a horizontal simply supported linear elastic Euler-Bernoulli beam (see  Table 1 and correspond to the ones of a UIC60 rail. Previous studies (Dimitrovová andRodrigues 2012, Castro Jorge et al. 2015a, b) showed that a 200 m simply supported beam is a good finite length model to approximate the behavior of an infinite beam on elastic foundation with a single moving load. The beam is supposed to be connected to a fixed foundation bed by a system of linear elastic springs, with stiffness per unit length denoted by k, associated in parallel with a continuous distribution of friction dampers with a maximum force per unit length fu. A downward concentrated force F = 83.4 kN, corresponding to half of the load per axle of a Thalys high speed train locomotive, acts on the beam moving from left to right at a constant velocity v (numerical results considered v ranging between 50 m/s and 300 m/s at intervals of 5 m/s). The motion of the beam is governed by a partial differential inclusion (eq. (2) in Toscano Corrêa et al. (2018))

Introduction
Machine learning, one of the six disciplines of Artificial Intelligence (AI) without which the task of having machines acting humanly could not be accomplished, allows us to 'teach' computers how to perform tasks by providing examples of how they should be done (Hertzmann and Fleet 2012). When there is abundant data (also called examples or patterns) explaining a certain phenomenon, but the theory behind is poor or absent, machine learning can be a useful tool. The world is quietly being reshaped by machine learning, the Artificial Neural Network (also referred in this manuscript as ANN or neural net) being its (i) oldest (McCulloch and Pitts 1943) and (ii) most powerful (Hern 2016) technique. ANNs also lead the number of practical applications, virtually covering any field of knowledge Irwin 2011, Prieto et. al 2016). In its most general form, an ANN is a mathematical model designed to perform a particular task, inspired 7 by the way the brain processes information, i.e. based on its processing units (the neurons). ANNs have been employed to perform several types of real-world basic tasks. Concerning functional approximation, ANN-based solutions are frequently more accurate than those provided by traditional approaches, such as multi-variate nonlinear regression, besides not requiring a good knowledge of the function shape being modeled (Flood 2008). The general ANN structure consists of several nodes disposed in L vertical layers (input layer, hidden layers, and output layer) and connected between them, as depicted in Fig. 2. Associated to each node in layers 2 to L, also called neuron, is a linear or nonlinear transfer (also called activation) function, which receives the so-called net input and transmits an output (as depicted later in Fig. 5). All ANNs implemented in this work are called feedforward, since data presented in the input layer flows in the forward direction only, i.e. every node only connects to nodes belonging to layers located at the right-hand-side of its layer, as shown in Fig. 2. ANN's computing potential makes them suitable to efficiently solve small to large-scale complex problems, which can be attributed to their (i) massively parallel distributed structure and (ii) 8 ability to learn and generalize, i.e., produce reasonably accurate outputs for inputs not used during the learning (also called training) phase.

Learning
Each connection between two nodes is associated to a synaptic weight (real value), which, together with each neuron's bias (also a real value), are the most common types of neural net unknown parameters that will be determined through learning. Learning is nothing else than determining network unknown parameters through some algorithm in order to minimize network's performance measure, typically a function of the difference between predicted and target (desired) outputs. When ANN learning has an iterative nature, it consists of three phases: (i) training, (ii) validation, and (iii) testing. From previous knowledge, examples or data points are selected to train the neural net, grouped in the so-called training dataset. Those examples are said to be 'labeled' or 'unlabeled', whether they consist of inputs paired with their targets, or just of the inputs themselves; learning is called supervised (e.g., functional approximation, classification) or unsupervised (e.g., clustering), whether data used is labeled or unlabeled, respectively. During an iterative learning, while the training dataset is used to tune network unknowns, a process of cross-validation takes place by using a set of data completely distinct from the training counterpart (the validation dataset), so that the generalization performance of the network can be attested. Once 'optimum' network parameters are determined, typically associated to a minimum of the validation performance curve (called early stopsee Fig. 3), many authors still perform a final assessment of model's accuracy, by presenting to it a third fully distinct dataset called 'testing'. Heuristics suggests that early stopping avoids overfitting, 9 i.e. the loss of ANN's generalization ability. One of the causes of overfitting might be learning too many input-target examples suffering from data noise, since the network might learn some of its features, which do not belong to the underlying function being modeled (Haykin 2009).

Implemented ANN features
The 'behavior' of any ANN depends on many 'features'. Fifteen of them were implemented in this work (including data pre/post processing ones). For those features, it is important to bear in mind that no ANN guarantees good approximations via extrapolation (either in functional approximation or classification problems), i.e. the implemented ANNs should not be applied outside the input variable ranges used for network training. Since there are no objective rules dictating which method per feature guarantees the best network performance for a specific problem, an extensive parametric analysis (composed of nine parametric sub-analyses) was carried out to find 'the optimum' net design. A description of all implemented methods, selected from state of art literature on ANNs (including both traditional and promising modern techniques), is presented next; Tables 2-4 present all features and methods per feature. The whole work was coded 10 in MATLAB (The Mathworks, Inc. 2017), making use of its neural network toolbox when dealing with popular learning algorithms (1-3 in Table 4). Each parametric sub-analysis (SA) consists of running all feasible combinations (also called 'combos') of pre-selected methods for each ANN feature, in order to get performance results for each designed net, thus allowing the selection of the best ANN according to a certain criterion. The best network in each parametric SA is the one exhibiting the smallest average relative error (called performance) for all learning data. The most widely used form of dimensional analysis is the Buckingham's π-theorem, which was implemented in this work as described in Bhaskar and Nigam (1990).  When designing any ANN, it is crucial for its accuracy that the input variables are independent and relevant to the problem (Gholizadeh et al. 2011, Kasun et al. 2016). There are two types of dimensionality reduction, namely (i) feature selection (a subset of the original set of input variables is used), and (ii) feature extraction (transformation of initial variables into a smaller set). In this work, dimensionality reduction is never performed when the number of input variables is less than six. The implemented methods are described next.

Linear Correlation
In this feature selection method, all possible pairs of input variables are assessed with respect to their linear dependence, by means of the Pearson correlation coefficient RXY, where X and Y denote any two distinct input variables. For a set of n data points (xi, yi), the Pearson correlation is defined by where (i) Var(X) and Cov(X, Y) are the variance of X and covariance of X and Y, respectively, Concerning the learning algorithm used for all AEs, no L2 weight regularization was employed, which was the only default specification not adopted in 'trainAutoencoder(…)'.

Orthogonal and Sparse Random Projections
This is another feature extraction technique aiming to reduce the dimension of input data Y1 (Q1 x P) while retaining the Euclidean distance between data points in the new feature space. 14 This is attained by projecting all data along the (i) orthogonal or (ii) sparse random matrix A (Q1 x Q2, Q2 < Q1), as described by Kasun et al. (2016)

Training, Validation and Testing Datasets (feature 4)
Four distributions of data (methods) were implemented, namely pt-pv-ptt = {80-10-10, 70-15-15, 60-20-20, 50-25-25}, where pt-pv-ptt represent the amount of training, validation and testing examples as percentage of all learning data (P), respectively. Aiming to divide learning data into training, validation and testing subsets according to a predefined distribution pt-pv-ptt, the following algorithm was implemented (all variables are involved in these steps, including qualitative ones after converted to numericsee 3.3.1): 1) For each variable q (row) in the complete input dataset, compute its minimum and maximum values.
2) Select all patterns (if some) from the learning dataset where each variable takes either its minimum or maximum value. Those patterns must be included in the training dataset, regardless what pt is. However, if the number of patterns 'does not reach' pt, one should add the missing amount, provided those patterns are the ones having more variables taking extreme (minimum or maximum) values.
3) In order to select the validation patterns, randomly select pv / (pv + ptt) of those patterns not belonging to the previously defined training dataset. The remainder defines the testing dataset.
It might happen that the actual distribution pt-pv-ptt is not equal to the one imposed a priori (before step 1), which is due to the minimum required training patterns specified in step 2. The progress of training can be impaired if training data defines a region that is relatively narrow in some dimensions and elongated in others, which can be alleviated by normalizing each input variable across all data patterns. The implemented techniques are the following: Lachtermacher and Fuller (1995) proposed a simple normalization technique given by  

Nonlinear
Proposed by Pu and Mesbahi (2006), although in the context of output normalization, the only nonlinear normalization method implemented for input data reads The implemented Bilinear function is defined as The Identity activation is often employed in output neurons, reading () s s  = . (10)

Output Normalization (feature 7)
Normalization can also be applied to the output variables so that, for instance, the amplitude of the solution surface at each variable is the same. Otherwise, training may tend to focus (at least in the earlier stages) on the solution surface with the largest amplitude (Flood and Kartam 1994a).
Normalization ranges not including the zero value might be a useful alternative since convergence issues may arise due to the presence of many small (close to zero) target values (Mukherjee et al. 1996). Four normalization methods were implemented. The first three follow eq. 1], respectively. The fourth normalization method implemented is the one described by eq. (6).

Multi-Layer Perceptron Network (MLPN)
This is a feedforward ANN exhibiting at least one hidden layer. Fig. 2 depicts a 3-2-1 MLPN (3 input nodes, 2 hidden neurons and 1 output neuron), where units in each layer link only to some nodes located ahead. At this moment, it is appropriate to define the concept of partially-(PC) and fully-connected (FC) ANNs. In this work a FC feedforward network is characterized by having each node connected to every node in a different layer placed forwardany other type of network is said to be PC (e.g., the one in Fig. 2). According to Wilamowski (2009), PC MLPNs are less powerful than MLPNs where connections across layers are allowed, which usually lead to smaller networks (less neurons).
where ym1p is the value of the m th network input concerning example p. The output of a generic neuron can then be written as where φl is the transfer function used for all neurons in layer l (l = 2,…, L).

Radial-Basis Function Network (RBFN)
Although having similar topologies, RBFN and MLPN behave very differently due to distinct hidden neuron modelsunlike the MLPN, RBFN have hidden neurons behaving differently than output neurons. According to Xie et al. (2011), RBFN (i) are specially recommended in functional approximation problems when the function surface exhibits regular peaks and valleys, and (ii) perform more robustly than MLPN when dealing with noisy input data. Although traditional RBFN have 3 layers, a generic multi-hidden layer (see Fig. 4) RBFN is allowed in this work, being the generic hidden neuron's model concerning node 'l1l2' (l1 = 1,…,Ql2, l2 = 2,…, L-1) presented in Fig. 6. In this model, (i) 1 2 and 1 2 (called RBF center) are vectors of the same size ( 1 2 denotes de z component of vector 1 2 , and it is a network unknown), being the former associated to the presentation of data pattern p, (ii) 1 2 is called RBF width (a positive scalar) and also belongs, along with synaptic weights and RBF centers, to the set of network unknowns to be determined through learning, (iii) 2 is the user-defined radial basis (transfer) function (RBF), described in eqs. (20)-(23), and (iv) 1 2 is neuron's output when pattern p is presented to the network. In ANNs not involving learning algorithms 1-3 in Table 4, vectors 1 2 and 1 2 are defined as (two versions of 1 2 where implemented and the one yielding the best results was selected) 1 2 2 2 1 2 2 2 1 2 2 2 1 2 1 2 2 2 2 1 2 1 2 1 2 1 2 1 1 2 2 whereas the RBFNs implemented through MATLAB neural net toolbox (involving learning algorithms 1-3 in Table 4) are based on the following definitions 1 2 2 2 2 1 2 2 1 2 2 1 2 2 1 2 1 2 Lastly, according to the implementation carried out for initialization purposes (described in 3.3.12), (i) RBF center vectors per hidden layer (one per hidden neuron) are initialized as integrated in a matrix (termed RBF center matrix) having the same size of a weight matrix linking the previous layer to that specific hidden layer, and (ii) RBF widths (one per hidden neuron) are initialized as integrated in a vector (called RBF width vector) with the same size of a hypothetic bias vector.

Hidden Nodes (feature 9)
Inspired by several heuristics found in the literature for the determination of a suitable number of hidden neurons in a single hidden layer net (Aymerich and Serra 1998, Rafiq et al. 2001, Xu and Chen 2008, each value in hntest, defined in eq. (15) ) : where (i) Q1 and QL are the number of input and output nodes, respectively, (ii) P and Pt are the number of learning and training patterns, respectively, and (iii) F13 is the number of feature 13's method (see Table 4). For this ANN feature, three methods were implemented, namely (i) adjacent layersonly connections between adjacent layers are made possible, (ii) adjacent layers + input-outputonly connections between (ii1) adjacent and (ii2) input and output layers are allowed, and (iii) fully-connected (all possible feedforward connections).
(9), defined in 3.3.6, the ones defined next were also implemented as hidden transfer functions.
During software validation it was observed that some hidden node outputs could be infinite or NaN (not-a-number in MATLABe.g., 0/0=Inf/Inf=NaN), due to numerical issues concerning some hidden transfer functions and/or their calculated input. In those cases, it was decided to convert infinite to unitary values and NaNs to zero (the only exception was the bipolar sigmoid function, where NaNs were converted to -1). Another implemented replacement was to convert possible Gaussian function's NaN inputs to zero.

Identity-Logistic
In Gunaratnam and Gero (1994), issues associated with flat spots at the extremes of a sigmoid function were eliminated by adding a linear function to the latter, reading 24

Bipolar
The so-called bipolar sigmoid activation function mentioned in Lefik and Schrefler (2003),

Positive Saturating Linear
In MATLAB neural net toolbox, the so-called Positive Saturating Linear transfer function, Concerning less popular transfer functions, reference is made in Bai et al. (2014) to the sinusoid, which in this work was implemented as

Radial Basis Functions (RBF)
Although Gaussian activation often exhibits desirable properties as a RBF, several authors (ii) the next function is employed as Gaussian-type function when learning algorithms 4-7 are used (see Table 4) (iii) the Multiquadratic function is given by and (iv) the Gaussian-type function (called 'radbas' in MATLAB toolbox) used by RBFNs trained with learning algorithms 1-3 (see Table 4), is defined by ( ) where || … || denotes the Euclidean distance in all functions.

Parameter Initialization (feature 12)
The initialization of (i) weight matrices (Qa x Qb, being Qa and Qb node numbers in layers a and b being connected, respectively), (ii) bias vectors (Qb x 1), (iii) RBF center matrices (Qc-1 x Qc, being c the hidden layer that matrix refers to), and (iv) RBF width vectors (Qc x 1), are independent and in most cases randomly generated. For each ANN design carried out in the context of each parametric analysis combo, and whenever the parameter initialization method is not the 'Mini-Batch SVD', ten distinct simulations varying (due to their random nature) initialization values are carried out, in order to find the best solution. The implemented initialization methods are described next.

Rand [-Δ, Δ]
This function is based on the proposal in Waszczyszyn (1999), and generates random numbers with uniform distribution in [-Δ, Δ], being Δ layer-dependent and defined by where a and b refer to the initial and final layers integrating the matrix being initialized, and L is the total number of layers in the network. In the case of a bias or RBF width vector, Δ is always taken as 0.5.

SVD
Although Deng et al. (2016) proposed this method for a 3-layer network, it was implemented in this work regardless the number of hidden layers.

Mini-Batch SVD
Based on Deng et al. (2016), this scheme is an alternative version of the former SVD. Now, training data is split into min{Qb, Pt} chunks (or subsets) of equal size Pti = max{floor(Pt / Qb), 1} -"floor" rounds the argument to the previous integer (whenever it is decimal) or yields the argument itself, being each chunk aimed to derive Qbi = 1 hidden node.

Learning Algorithm (feature 13)
The most popular learning algorithm is called error back-propagation (BP), a first-order gradient method. Second-order gradient methods are known to have higher training speed and accuracy (Wilamowski 2011). The most employed is called Levenberg-Marquardt (LM). All these traditional schemes were implemented using MATLAB toolbox (The Mathworks, Inc 2017).

Performance Improvement (feature 14)
A simple and recursive approach aiming to improve ANN accuracy is called Neural Network Composite (NNC), as described in Beyer et al. (2006). In this work, a maximum of 10 extra ANNs were added to the original one, until maximum error is not improved between successive NNC solutions. Later in this manuscript, a solution given by a single neural net might be denoted as ANN, whereas the other possible solution is called NNC.

Training Mode (feature 15)
Depending on the relative amount of training patterns, with respect to the whole training dataset, that is presented to the network in each iteration of the learning process, several types of training modes can be used, namely (i) batch or (ii) mini-batch. Whereas in the batch mode all training patterns are presented (called an epoch) to the network in each iteration, in the mini-batch counterpart the training dataset is split into several data chunks (or subsets) and in each iteration a single and new chunk is presented to the network, until (eventually) all chunks have been presented. Learning involving iterative schemes (e.g., BP-or LM-based) might require many epochs until an 'optimum' design is found. The particular case of having a mini-batch mode where all chunks are composed by a single (distinct) training pattern (number of data chunks = Pt , chunk size = 1), is called online or sequential mode. Wilson and Martinez (2003) suggested that if one wants to use mini-batch training with the same stability as online training, a rough estimate of the suitable learning rate to be used in learning algorithms such as the BP, is ηonline /√ , where cs is the chunk size and ηonline is the online learning ratetheir proposal was adopted in this work. Based on the proposal of Liang et al. (2006), the constant chunk size (cs) adopted for all chunks in mini-batch mode reads cs = min{mean(hn) + 50, Pt}, being hn a vector storing the number of hidden nodes in each hidden layer in the beginning of training, and mean(hn) the average of all values in hn.

Network Performance Assessment
Several types of results were computed to assess network outputs, namely (i) maximum error, (ii) percentage of errors greater than 3%, and (iii) performance, which are defined next.
where (i) dqp is the q th desired (or target) output when pattern p within iteration i (p=1,…, Pi) is presented to the network, and (ii) yqLp is net's q th output for the same data pattern. Moreover, denominator in eq. (25) is replaced by 1 whenever |dqp| < 0.05; dqp in the nominator keeps its real value. This exception to eq. (25) aims to reduce the apparent negative effect of large relative errors associated to target values close to zero. Even so, this replacement may still lead to (relatively) large solution errors while groundbreaking results are depicted as regression plots (target vs. predicted outputs).

Maximum Error
This variable measures the maximum relative error, as defined by eq. (25), among all output variables and learning patterns.

Percentage of Errors larger than 3%
This variable measures the percentage of relative errors (see eq. (25)) that are larger than 3%, among all output variables and learning patterns.

Performance
In functional approximation problems, network performance is defined as the average relative error, as defined in eq. (25), among all evaluated output variables and data patterns (e.g., training, all data). 31

Software Validation
Several benchmark datasets/functions were used to validate the developed software, involving low-to high-dimensional problems and small to large volumes of data. Due to paper length limit, validation results are not presented herein but they were made public online (Researcher 2018). In spite of the successful validation, several improvements have been implemented since the initial use of the software in first author's research projects.

Results and Proposed ANN-based Models
Aiming to reduce the computing time by cutting in the number of combos to be runnote that all features combined lead to hundreds of millions of combos, the whole parametric simulation was divided into nine parametric sub-analyses (SAs), where in each one feature 7 only takes a single value. This measure aims to make the performance ranking of all combos within each SA analysis more 'reliable', since results used for comparison are based on target and output datasets as used in ANN training and yielded by the designed network, respectively (they are free of any postprocessing that eliminates output normalization effects on relative error values). Whereas It is important to note that, in this manuscript, whenever a vector is added to a matrix, it means the former is added to all columns of the latter (valid in MATLAB).

Negative wmax (v = [50, 175] ∪ [250, 300] m/s)
ANN feature methods used in the best combo from each of the abovementioned nine parametric SAs are specified in Table 5 (see Tables 2-4). Table 6 shows the corresponding relevant results for those combos and the 481-point final testing dataset (which includes the ANN learning/development dataset), namely (i) maximum error, (ii) percentage of errors larger than 3%, (iii) performance (all described in sub-section 3.4, and evaluated for all learning data), (iv) total number of hidden nodes in the model, and (v) average 33 computing time per example (including data pre-and post-processing). All results shown in Table 6 are based on target and output datasets computed in their original format, i.e. free of any transformations due to output normalization and/or dimensional analysis.
Summing up the ANN feature combinations for all parametric SAs, a total of 219 combos were run for this problem. The proposed model is the one, among the best ones from all parametric SAs, exhibiting

Input Data Preprocessing
For future use of the proposed NNC to simulate new data Y1,sim (a 2 x Psim matrix) concerning Psim patterns, the same data preprocessing (if any) performed before training must be applied to the input dataset. That preprocessing is defined by the methods used for ANN features 2, 3 and 5 (respectively 2, 6 and 5see Table 2). Next, the necessary preprocessing to be applied to Y1,sim, concerning features 2, 3 and 5, is fully described.

Dimensional Analysis and Dimensionality Reduction
Since no dimensional analysis (d.a.) nor dimensionality reduction (d.r.) were carried out, one has    

ANN-Based Analytical Model
Once determined the preprocessed input dataset {Y1,sim}n after (a 2 x Psim matrix), the next step is to present it to the proposed NNC to obtain the predicted output dataset {Y3,sim}n after (a 1 x Psim vector), which will be given in the same preprocessed format of the target dataset used in learning. In order to convert the predicted outputs to their 'original format' (i.e., without any transformation due to normalization or dimensional analysis), some postprocessing might be needed, as discussed in 4.1.3. Next, the mathematical representation of the proposed NNC is given, so that any user can implement it to determine {Y3,sim}n after , thus contributing to diminish the generalized opinion that ANNs are 'black boxes': since no output normalization nor dimensional analysis were adopted in the proposed model.

Performance Results
Finally, the results yielded by the proposed NNC for the 481-point final testing dataset (which includes the ANN learning/development counterpart), in terms of performance variables defined in sub-section 3.4, are presented in this sub-section in the form of two graphs: (i) a regression plot (Fig. 8), where network target and output data are plotted, for each data point, as x-and ycoordinates, respectivelya measure of quality is given by the Pearson Correlation Coefficient (R), as defined in eq. (1); and (ii) a plot ( Fig. 9) indicating (for all data) the (ii1) maximum error, (ii2) percentage of errors larger than 3%, and (ii3) average error (called performance).  Table 7 (numbers represent the method number as in Tables 2-4). Table 8 shows the corresponding relevant results for those combos and the 208-point final testing dataset (which includes the ANN learning/development dataset), namely (i) maximum error, (ii) percentage of errors larger than 3%, (iii) performance (all described in sub-section 3.4, and evaluated for all learning data), (iv) total number of hidden nodes in the model, and (v) average computing time per example (including data pre-and post-processing). All results shown in Table 8 are based on target and output datasets computed in their original format, i.e. free of any transformations due to output normalization and/or dimensional analysis.
Summing up the ANN feature combinations for all parametric SAs, a total of 204 combos were run for this problem.
The proposed model is the one, among the best ones from all parametric SAs, exhibiting the lowest maximum error (SA 9 -a Neural Network Composite (NNC)). Aiming to allow implementation of this model by any user, all variables/equations required for (i) data preprocessing, (ii) ANN simulation, and (iii) data postprocessing, are presented in the following sub-sections. The proposed model is an NNC made of 3 ANNs with architecture RBFN and a distribution of nodes/layer given by 2-3-3-3-1 for every network. Concerning connectivity, all networks are partially-connected (see Fig. 10), and the hidden and output transfer functions are all Gaussian RBF (eq. (23)) and Hyperbolic Tangent (eq. (8)), respectively. All networks were trained using the LM algorithm. After design, the average NNC computing time concerning the presentation of a single example (including data pre/postprocessing) is 7.87E-05 s.   Fig. 10. Proposed NNC made of 3 partially-connected RBFNssimplified scheme.

Input Data Preprocessing
For future use of the proposed NNC to simulate new data Y1,sim (a 2 x Psim matrix) concerning Psim patterns, the same data preprocessing (if any) performed before training must be applied to the input dataset. That is defined by the methods used for ANN features 2, 3 and 5 41 (respectively 2, 6 and 5see Table 2). Next, the necessary preprocessing to be applied to Y1,sim is fully described.
where one recalls that operator './' divides row i in the numerator by INP(i, 2).

ANN-Based Analytical Model
Once determined the preprocessed input dataset {Y1,sim}n after (a 2 x Psim matrix), the next step is to present it to the proposed NNC to obtain the predicted output dataset {Y5,sim}n after (a 1 x Psim vector), which will be given in the same preprocessed format of the target dataset used in learning. To convert the predicted outputs to their 'original format' (i.e., without any transformation due to normalization or dimensional analysis), some postprocessing might be needed, as described in 4.2.3. Next, the mathematical representation of the proposed NNC is given, so that any user can implement it to determine {Y5,sim}n after :

Performance Results
Finally, the results yielded by the proposed NNC for the 208-point final testing dataset (which includes the ANN learning/development counterpart), in terms of performance variables defined in sub-section 3.4, are presented in this sub-section in the form of two graphs: (i) a regression plot (Fig. 11), where network target and output data are plotted, for each data point, as x-and ycoordinates, respectively; and (ii) a plot (Fig. 12) indicating (for all data) the (ii1) maximum error, (ii2) percentage of errors larger than 3%, and (ii3) average error (called performance).  Table 9 (numbers represent the method number as in Tables 2-4). Table 10 shows the corresponding relevant results for those combos and the 481-point final testing dataset (which includes the ANN learning/development dataset), namely (i) maximum error, (ii) percentage of errors larger than 3%, (iii) performance (all described in sub-section 3.4, and evaluated for all learning data), (iv) total number of hidden nodes in the model, and (v) average computing time per example (including data pre-and post-processing). All results shown in Table   10 are based on target and output datasets computed in their original format, i.e. free of any transformations due to output normalization and/or dimensional analysis. Summing up the ANN feature combinations for all parametric SAs, a total of 219 combos were run for this problem. The proposed model is the one, among the best ones from all parametric SAs, exhibiting the lowest maximum error (SA 9). Aiming to allow implementation of this model by any user, all variables/equations required for (i) data preprocessing, (ii) ANN simulation, and (iii) data 47 postprocessing, are presented in the following sub-sections. The proposed model is a single MLPN with 5 layers and a distribution of nodes/layer given by 2-3-3-3-1. Concerning connectivity, the network is fully-connected, and the hidden and output transfer functions are all Logistic (eq. (7)) and Identity (eq. (10)), respectively. The network was trained using the LM algorithm (

Input Data Preprocessing
For future use of the proposed ANN to simulate new data Y1,sim (a 2 x Psim matrix) concerning Psim patterns, the same data preprocessing (if any) performed before training must be applied to the input dataset. That preprocessing is defined by the methods used for ANN features 2, 3 and 5 (respectively 2, 6 and 3see Table 2). In what follows, the necessary preprocessing to be applied to Y1,sim is fully described.

Dimensional Analysis and Dimensionality Reduction
Since no dimensional analysis (d.a.) nor dimensionality reduction (d.r.) were carried out, one has    

ANN-Based Analytical Model
Once determined the preprocessed input dataset {Y1,sim}n after (a 2 x Psim matrix), the next step is to present it to the proposed ANN to obtain the predicted output dataset {Y5,sim}n after (a 1 x Psim vector), which will be given in the same preprocessed format of the target dataset used in learning. In order to convert the predicted outputs to their 'original format' (i.e., without any transformation due to normalization or dimensional analysis), some postprocessing might be needed, as described in 4.3.3. Next, the mathematical representation of the proposed ANN is given, so that any user can implement it to determine {Y5,sim}n after :  Arrays Wj-s and bs can be found online in Developer (2018c).

Output Data Postprocessing
In order to transform the output dataset obtained by the proposed ANN, {Y5,sim}n after (a 1 x

Performance Results
Finally, the results yielded by the proposed ANN for the 481-point final testing dataset (which includes the ANN learning/development counterpart), in terms of performance variables defined in sub-section 3.4, are presented in this sub-section in the form of two graphs: (i) a regression plot (Fig. 14), where network target and output data are plotted, for each data point, as x-and y-coordinates, respectively; and (ii) a plot (Fig. 15) indicating (for all data) the (ii1) maximum error, (ii2) percentage of errors larger than 3%, and (ii3) average error (called performance).

Positive wmax (v = ]175, 250[ m/s)
ANN feature methods used in the best combo from each of the abovementioned nine parametric SAs are specified in Table 11 (numbers represent the method number as in Tables 2-4). Table 12 shows the corresponding relevant results for those combos and the 208-point final testing dataset (which includes the ANN learning/development dataset), namely (i) maximum error, (ii) percentage of errors larger than 3%, (iii) performance (all described in sub-section 3.4, and evaluated for all learning data), (iv) total number of hidden nodes in the model, and (v) average computing time per example (including data pre-and post-processing). All results shown in Table   12 are based on target and output datasets computed in their original format, i.e. free of any 53 transformations due to output normalization and/or dimensional analysis. Summing up the ANN feature combinations for all parametric SAs, a total of 219 combos were run for this problem. The proposed model is the one, among the best ones from all parametric SAs, exhibiting the lowest maximum error (SA 9 -a Neural Network Composite (NNC)). Aiming to allow implementation of this model by any user, all variables/equations required for (i) data preprocessing, (ii) ANN simulation, and (iii) data postprocessing, are presented in the following sub-sections. The proposed model is an NNC made of 4 ANNs with architecture MLPN and a distribution of nodes/layer given by 2-3-3-3-1 for every network. Concerning connectivity, all networks are fully-connected, and the hidden and output transfer functions are all Logistic (eq. (7)) and Identity (eq. (10)), respectively. All networks were trained using the LM algorithm. After design, the average NNC computing time concerning the presentation of a single example (including data pre/postprocessing) is 4.08E-05 s; Fig. 16 18.9 0.9 5. vector), which will be given in the same preprocessed format of the target dataset used in learning. In order to convert the predicted outputs to their 'original format' (i.e., without any transformation due to normalization or dimensional analysis), some postprocessing might be needed, as described in 4.4.3. Next, the mathematical representation of the proposed NNC is given, so that any user can implement it to determine {Y5,sim}n after :

Output Data Postprocessing
In order to transform the output dataset obtained by the proposed NNC, {Y5,sim}n after (a 1 x Psim vector), to its original format (Y5,sim), i.e. without the effects of dimensional analysis and/or output normalization (possibly) taken in target dataset preprocessing prior training, one has 57   5, 5, = sim s ae im ft r n YY , (48) since no output normalization nor dimensional analysis were adopted in the proposed model.

Performance Results
Finally, the results yielded by the proposed NNC for the 208-point final testing dataset (which includes the ANN learning/development counterpart), in terms of performance variables defined in sub-section 3.4, are presented in this sub-section in the form of two graphs: (i) a regression 58 plot (Fig. 17), where network target and output data are plotted, for each data point, as x-and ycoordinates, respectively; and (ii) a plot (Fig. 18) indicating (for all data) the (ii1) maximum error, (ii2) percentage of errors larger than 3%, and (ii3) average error (called performance).

Critical velocities and maximum displacements predictions
Eleven pairs of curves were obtained as output of the ANN-based models described in subsections 4.1-4.4. Each pair presents the maximum negative (downward) and positive (upward) displacement predictions as function of load velocity (from 50 to 300 m/s in intervals of 5 m/s) for different values of the maximal distributed friction force fu, as depicted in Fig. 19 (two plots are presented for the sake of legibility). Note that the classic Winkler foundation case corresponds to the frictionless case (fu = 0). Comparing the homologous curves in Fig.19 and The set of curves shows that the increase of the maximum frictional force per unit length (fu) leads, as expected, to the reduction of the displacement peaks. The existence of a critical velocity, that is, a velocity that induces the beam's highest displacements, is also clear in Fig.   19. It is observed that, for small values of fu, the value of the critical velocity is just slightly affected, whereas for larger frictional forces that value clearly rises.

Discussion
In future publications it will be guaranteed that the validation and testing data subsets will be composed only by points where at least one variable (does not have to be the same for all) takes a value not taken in the training subset by that same variable. Based on very recent empirical conclusions by Abambres, the author believes it will lead to more robust ANN-based analytical models concerning their generalization ability (i.e. prediction accuracy for any data point within the variable ranges of the design data).

Final Remarks
This ANN-based analytical models for the same type of problem may follow from this study by including more independent variables, such as the foundation stiffness modulus, the applied load magnitude, and the geometrical/mechanical properties of the railway beam.
Regardless the high quality of the predictions yielded by the proposed model, the reader should not blindly accept it as accurate for any other instances falling inside the input domain of the design dataset. Any analytical approximation model must undergo extensive validation before it can be taken as reliable (the more inputs, the larger the validation process). Models proposed meanwhile are part of a learning process towards excellence.