An efficient machine learning approach for predicting concrete ...

13 Sep 2023

Nature.com

Introduction

The concrete composition considerably impacts concrete's mechanical and durability characteristics1,2. Despite the mechanical properties, understanding the effect of concrete composition on durability is still challenging for researchers in this field3. The emergence of a new generation of concretes (including recycled, innovative, and sustainable materials) has increased this concern about the durability of concrete structures. The durability of reinforced concrete (RC) structures is considerably influenced by chloride penetration, especially in coastal and chloride-rich environments, resulting in a severe corrosion issue for steel reinforcement or steel fibers4. This corrosion phenomenon causes considerable economic loss and environmental influence in the construction industry. Various resources cause chloride attacks, such as de-icing salt, seawater, and groundwater. Different parameters substantially affect the chloride resistance of RC members, including concrete composition, temperature, relative humidity, and carbonation. Among these variables, concrete composition plays a major role in controlling internal damage of chloride attack. Previous studies strongly emphasized that concrete composition directly affects the fresh, mechanical, and durability characteristics2,5. Concrete composition means the water-to-binder ratio (W/B), water content, cement content, cement type, aggregate content, aggregate type, fillers, nanomaterials, mineral additives, and chemical admixtures. Each of these ingredients affects the concrete porosity. Achieving a reliable concrete mixture with the lowest permeability to prevent the penetration of the chloride ions is one of the main practical solutions to mitigate internal chloride attack damages. For instance, different supplementary cementitious materials (SCMs), fillers, and nanomaterials can be used to obtain an efficient concrete mixture resistant to chloride attack. However, as innovative and new concrete generations have been introduced in recent years, standards recommended experimental rapid tests to determine the performance of these new cementitious composites against chloride attack, including salt ponding, bulk diffusion, and rapid chloride permeability tests. These tests are time-consuming (more than 28 days) and require an adequate material, researchers, and testing equipment budget. Hence, a practical solution and an extensive experimental program are necessary to determine the efficiency of a proposed mixture for concrete members exposed to chloride attack.

Experimental tests considered different parameters to determine the chloride resistance of mixtures. As mentioned in Eq. (1), Fick’s second law and its analytical solution have usually been used to explain the chloride diffusion of concrete for measuring the service life design of RC members6,7, where $C(x. t)$ is the chloride ions concentration, $t$ and $x$ denote the time and position from the exposed concrete surface, respectively, ${D}_{C}$ is the chloride diffusion coefficient, C0 is the initial chloride concentration in concrete, and Cs is the apparent surface chloride concentration.

$$\frac{{\partial C\left( {x. t} \right)}}{\partial t} = D_{C} \frac{{\partial^{2} C\left( {x. t} \right)}}{{\partial x^{2} }}$$

(1)

$$C\left( {x. t} \right) = C_{0} + \left( {C_{S} - C_{0} } \right)\left[ {1 - erf\left( {\frac{x}{{2\sqrt {D_{C} t} }}} \right)} \right]$$

(2)

Among these empirical parameters, the chloride diffusion coefficient plays a critical role in showing the chloride resistance of concrete mixtures. Different parameters affect ${D}_{C}$ as a time-dependent material characteristic, including concrete composition, curing conditions, chloride exposure time, and exposure location8. Many experimental test methods were used to determine the chloride diffusion coefficient. As recommended by ASTM C1556–119 and NT Build 44310, bulk diffusion tests are time-consuming tests where concrete samples are exposed to a chloride solution for a long period. In this context, Nordic standard NT Build 49211 proposed an accelerated test method, where chloride ions infuse the concrete at high rates caused by the applied electric field. The chloride diffusion coefficient measured by this method is denoted as the “non-steady-state migration coefficient (${D}_{nssm}$)”. Standard criteria for classification of chloride penetration resistance of concrete based on this NT Build 49211 method compared to ASTM C120212 is mentioned in Table 1. ASTM C120212 considers five categories for chloride ion penetrability based on the passed electric charge (PEC), while ${D}_{nssm}$ was used in NT Build 49211. Although the literature severally confirmed the accuracy of this method, the test procedure needs a practiced operator and accurate resources and facilities. Although this method's experimental results are precise, conducting this test for each concrete mixture for RC projects is not feasible and practical due to the long-term immersion period and costs. Hence, an efficient and practical solution is needed to determine the chloride resistance of concrete mixture for each project. Accordingly, it is crucial to advance a feasible and accurate predicting model that specify the chloride diffusion coefficients using concrete composition.

Table 1 Standard criteria for the classification of chloride penetration resistance of concrete.

Full size table

Recently, in the last few years, the use of artificial intelligence (AI) to predict cementitious composites' mechanical and durability characteristics has gained significant speed. However, specific types of concrete were considered for each study, whereas a comprehensive dataset was required in a unified AI model. Additionally, the number of datasets is another critical issue of these investigations. For instance, Amin et al. (2023)14 presented an effective predicting method for the strength of sustainable concrete utilizing rice husk ash (RHA) as supplementary cementitious materials using machine learning (ML) algorithms. Hosseinzadeh et al. (2023)15 used ML techniques (both regression and classification algorithms), to forecast the mechanical properties of fly ash-based recycled aggregate concrete (FARAC). Jiao et al. (2023)16 employed ML to predict the compressive strength of concrete with carbon nanotubes as nanomaterials. In this context, Zheng et al. (2023)17 developed an ML model to predict the compressive strength of silica fume concrete. Ullah et al. (2022)18,19 used the ML method to predict dry density and compressive strength of lightweight foamed concrete. In this field, Khan et al. (2021)20 used the ML method to formulate the depth of wear of fly-ash concrete. Song et al. (2021)21 worked on presenting an ML model to predict the compressive strength of concrete with fly ash admixture. Farooq et al. (2021)22 used the ML method to formulate the strength of self-compacting concrete (SCC). Ahmad et al. (2021)23 employed the ML method to forecast the compressive strength of concrete at high temperatures considering 207 experimental datasets. Ahmad et al. (2021)24 presented an ML model along with gene expression programming (GEP) and artificial neural network (ANN) to predict the compressive strength of recycled coarse aggregate (RCA)-based concrete. Farooq et al. (2021)25 employed ML algorithms with individual learners and ensemble learners (bagging, boosting) to forecast the mechanical properties of high-performance concrete using waste materials. Ahmad et al. (2021)26 used ML to predict the compressive strength of fly ash-based concrete. In this field, Khan et al. (2021)27 used soft-computing techniques, including random forest regression and gene expression programming to present a prediction model for the compressive strength of fly ash-based geopolymer concrete. Other studies in the literature also used AI methods in predicting the mechanical characteristics of cementitious composites28,29.

Accordingly, there has also been a growing interest in using AI techniques for predicting chloride diffusion coefficients, such as ANN, fuzzy inference systems, and ML methods. Table 2 comprehensively gathered all available AI models introduced by researchers for chloride diffusion coefficient. Thirty-four predicting models were proposed by the literature, considering different input variables, output factors, dataset numbers, and concrete types. Water-to-binder ratio (W/B), water content, cement content, fly ash (FA) content, silica fume (SF) content, ground granulated blast-furnace slag (GGBFS) content, superplasticizer dosage (SP), fine aggregate content (sand), coarse aggregate content, submerged duration time, curing age, air-entraining admixture, viscosity modifying admixtures dosage (VMA), testing age, concrete compressive strength (${f}_{c}$), specific surfaces of FA, environmental conditions, diffusion rate, etc., are the input variables considered in the current predicting models reported by the literature. Also, different output variables were considered in these studies, such as non-steady-state migration coefficient and surface chloride concentration (Table 2). Along with obtaining a predicting model for the chloride resistance of concrete, previous AI studies reported some critical and conflicting findings based on their soft computing models. Generally, two different types of AL models were presented by the literature, including (a) a model for only a specific type of concrete mixture; and (b) an integrated model with no limitation for concrete type. Although most of the models developed by previous studies concentrate on the first group, the literature recently introduced some valuable models for the second category. Regarding the first model type, Parichatprecha & Nimityongskul (2009)30 used the ANN method to predict chloride ions permeability using 86 datasets and 8 input variables for high-performance concrete (HPC) mixtures with compressive strength ranging from 30 to 120 MPa. They reported that the optimum cement content for chloride-resistant HPC is 450–500 kg/m3. Also, based on their model, the chloride resistance of HPC can be significantly improved using both SF and FA (around more than 20% cement replacement). Song & Kwon (2009)31 slightly increased the dataset (120 number) for HPC specimens and found that they added GGBFS along with other SCMs to the input variables to predict the diffusion coefficient of chloride ion using the ANN method. They also used the parameter “duration time in submerged condition” instead of “superplasticizer” for the input parameters. As the precision of the soft computing methods in AI is extremely dependent on the number of database, Hodhod & Ahmed (2013)32 used 300 datasets of HPC mixture for the ANN method. They considered 5 input variables of W/B, cement, FA, GGBFS, and curing age. Regarding self-consolidating concrete (SCC) mixtures, extensive AI models were also presented in the literature33,34,35. In this field, Ghafoori et al. (2013)36 used statistical and ANN models on their limited experimental database (only 24 SCC mixtures) to present a predicting model for the rapid chloride penetration test (RCPT) value choosing 6 input features of cementitious materials, W/B, coarse aggregate, fine aggregates, air-entraining admixture, and high range water reducer (HRWR). Although a general trend cannot be extracted based on their limited database, they reported that three independent parameters of cementitious materials content, W/B ratio, and coarse or fine aggregate are essential to predict the RCPT value of SCC. Accordingly, Mohamed et al. (2018)37 used 86 datasets of SCC mixture to present a predicting ANN model for chloride penetration level using comprehensive 11 input variables of W/C ratio, cement, GGBFS, FA, SF, water, superplasticizer, coarse aggregate, fine aggregate, age, and charge. Najimi et al. (2019)38 collected 72 datasets of SCC mixtures and used feed-forward ANN combined with an artificial bee colony algorithm to develop a chloride penetration model. They added air-entraining admixture and HRWR to their input variables. Kumar et al. (2020)39 gathered 360 datasets of SCC mixtures to predict chloride penetration using MARS and minimax probability machine regression (MPMR). They only considered three input variables of FA, SF, and temperature. In this context, the model presented by Yuan et al. (2022)34 revealed that temperature and SF considerably affect the model performance. In the field of high-strength concrete (HSC), Inthata et al. (2013)40 used 270 datasets of mixtures containing ground pozzolans to predict chloride penetration and chloride penetration depth. They added for the first time the actual concrete compressive strength to the input parameters in ANN models. Their model revealed that 40% of cement replacement by ground pozzolans shows the highest resistance to chloride penetration, especially for rice husk ash-contained concrete mixtures. In line with this research, Boğa et al. (2013)41 used 162 datasets of concrete containing GGBFS and calcium nitrite-based corrosion inhibitors to develop a predicting model for chloride ion permeability using a four-layered ANN and adaptive neuro-fuzzy inference system. They considered Cure type, curing period, GGBFS ratio, and calcium nitrite-based corrosion inhibitor ratio independent variables. Based on the dataset's limited range of concrete types, no general report can be extracted by their study. Regarding concrete containing high calcium FA, Marks et al. (2015)42 gathered 56 dataset and used ML method to predict the concrete resistance to chloride penetration. Their input parameters were water, cement, high calcium FA, and the specific surface of fly ash. Their model showed that to have a reliable chloride-resistance FA concrete, the water content should be controlled in the w ≤ 158 L/m3 range. However, they emphasize that the experimental mode dataset should be checked due to their limited dataset number. In the context of the limited dataset, Slika & Saad (2016)43 used Ensemble Kalman Filter (EnKF) on the limited dataset of NC to predict Chloride concentration. Regarding SF concrete, 162 datasets were collected by Asghshahr et al. (2016)44, and ANN, along with classification and regression tree (CART) methods, were utilized to predict the Chloride concentration. Only 4 input variables were considered in their models, including environmental condition, penetration depth, W/B ratio, and SF. Their models showed that the penetration depth is the most vital factor impacting the chloride concentration. However, they could not find the critical role of SF in their model. The literature also used AI models to predict chloride ion diffusion. For instance, Hoang et al. (2017)45 considered the ML model with the multi-gene Genetic programming (MGGP) and multivariate adaptive regression splines (MARS) to predict the chloride ion diffusion of cement mortars using mortar age, depth of measured point, diffusion dimension, the reinforcement presence as input variables. However, along with ML methods, valuable soft computing models have still been used using the ANN method on different types of concrete. In the field of AI techniques, genetic programming (GP) was also used by Gao et al. (2019)46 on 25 experimental databases of NC to predict chloride-ion diffusion. They found that the GP method was more efficient than the ANN model. However, based on their limited dataset, more studies should check their finding. Some previous studies also concentrated on predicting AI models of chloride resistance for recycled aggregate concrete (RAC). Ahmad et al. (2021)47 employed an ML model to predict the surface chloride concentration in concrete containing waste material, including gene expression programming (GEP), the decision tree (DT), and an ANN. In this field, Liu et al. (2022)48 used the ensemble ML method to predict the PEC gathering 226 dataset. Their soft computing technique found that water content is the most vital parameter controlling the chloride ion permeability of RAC. In this context, Jin et al. (2022)49 developed an ANN model to predict the chloride diffusion coefficient using 112 dataset. They considered 9 input variables consisting of cement, water, coarse aggregate, sand, water reducing agent, recycled aggregate replacement rate, W/C ratio, water absorption rate of coarse aggregate, and apparent density of coarse aggregate. However, they ignored two important parameters of recycled fines and mineral admixtures due to the lack of an appropriate dataset. Also, unnoticed the effect of cement type. Their model demonstrated that the natural fine aggregate content, along with recycled coarse aggregate, has the most substantial influence on the penetrability of RAC chloride ions. Regarding metakaolin-contained concrete, Alabdullah et al. (2022)50 gathered 201 dataset to develop ML method chloride migration coefficient. They used concrete aging, binder content, W/B ratio, metakaolin, sand, coarse aggregate, and concrete compressive strength as input features. Their model revealed that concrete compressive strength is the most critical parameter in predicting the chloride migration coefficient. However, they found that aggregate content showed the minimum importance. Also, it was deduced from their ML model that using metakaolin has a vital influence on chloride penetration resistance with the optimum dosage of 15%. In this regard, Amin et al. (2022)51 used the same dataset and input features of metakaolin-based concrete to develop a GEP model of predicting RCPT. Their model showed that the concrete age is the most noteworthy factor, along with aggregate content.

Table 2 Artificial intelligence (AI) techniques used in the literature for predicting concrete chloride resistance.

Full size table

Regarding the second category (models with no limitations), many studies gathered many types of concrete data in a specific database47,52,53,54. For instance, Delgado et al. (2020)52 used 243 datasets without concrete type limitations to develop ANN models of chloride depth penetration and diffusion coefficient. They considered cement type as an additional input variable for the first time. Based on their model, curing time is the most important input feature for the chloride penetration ANN model. In this field, Cai et al. (2020)53 collected 642 datasets for different types of concrete to predict surface chloride concentration using the ensemble ML method. Twelve input variables were considered in their model, including cement, water, FA, GGBFS, SF, Superplasticizer, fine aggregate, coarse aggregate, exposure time, Annual mean temperature, chloride content in seawater, and exposure type. Their predicting model showed that the exposure condition (i.e., tidal, splash and submerged zones) and W/B ratio seem to be the greatest important factors affecting the surface chloride concentration. Ahmad et al. (2021)47 gathered a dataset of concrete containing waste material to predict the surface chloride concentrations using gene expression programming (GEP). According to these studies, Liu et al. (2021)4 collected 653 datasets of different concrete types to develop an ANN model for predicting the chloride diffusion coefficient. They also considered the concrete compressive strength, curing mechanism, and common input variables. Tran et al. (2022)57 used an ensemble decision tree boosted (EDT Boosted) model to predict the surface chloride concentration considering the 386 datasets. Their model indicated that the fine aggregate content is the key parameter influencing the Cs. Tran (2022)58 concentrated on 127 concrete data-containing SCMs to develop an ML model for the chloride diffusion coefficient. They ignored using temperature and curing time as input parameters. It can be deduced from their model that water content and FA dosage have the most and least effect on chloride diffusion coefficient precision, respectively. They reported that GGBFS content has also a low impact on the output. Based on their model, water content and W/B ratio has negative relationship with the chloride diffusion coefficient. Guo et al. (2022)59,60 developed ML and Fuzzy logic system methods on two datasets (366 and 495 numbers) to predict surface chloride concentration without considering a specific type of concrete. Their models indicated that by increasing the W/B ratio, Cs increases. Moreover, mineral admixture and the W/B ratio affect the surface chloride concentration. Based on their model, FA and GGBFS cause higher and lower surface chloride concentration, respectively. Finally, the most recently studied in this field was conducted by Taffese & Espinosa-Leal (2022)55, where the ML method was used to predict the migration coefficient of different types of concrete in a unique dataset. Although they gathered an extensive experimental database, four separate models were used for different input variable groups due to some missing input variables. Each group has less than 200 datasets, including (1) first group with 134 dataset containing W/B ratio, Cement, Slag, FA, SF, fine aggregate, coarse aggregate, superplasticizer, migration test age, and cement types; (2) the second group uses 131 datasets with the same as group first input variables along with fresh density; (3) third group has 176 datasets considering the same input variables of the first group and additional parameters of compressive strength test age and compressive strength; and finally, (4) fourth group having all input variables with 91 datasets, showing that a limited number of the dataset mentioned both concrete compressive strength and fresh density in their studies. Hence, they couldn’t use the ML method on all 834 datasets due to the huge amount of missing input data. Similarly, Taffese & Espinosa-Leal (2022)61 only used the final 204 datasets for the ML method, considering input variables of W/B ratio, cement, water, slag, FA, SF, fine aggregate, coarse aggregate, total aggregate, superplasticizer, and air-entraining agent, and migration test age. Their finding from the proposed model showed that binder content and aggregate content considerably affect the predicting model. They also found that cement type has no impact, which is considered a weak predictor for classifying concrete chloride resistance and can be removed from the model. In this filed, Tran et al. (2022)56 used the ML technique to predict chloride content based on 404 dataset. Based on their model, exposure condition and depth of measurement are the most important parameters for the prediction of chloride content. In this context, Golafshani et al. (2022)8 developed a marine creatures-based metaheuristic artificial intelligence model to predict the apparent chloride diffusion using 216 dataset. Their model indicated that the exposure time and curing conditions considerably control the performance of the predicting model.

One of the main criteria for evaluating the performance of an AI model is simplicity. Accordingly, the number of input variables and accessibility should be considered in a predicting model. Hence, based on the experience of other proposed models and SHapley Additive exPlanations (SHAP) results, some input variables can be ignored to predict the chloride diffusion coefficient. Predominant input variables reported for each literature on AI methods are summarized in Table 3. It is clearly shown that based on the number of datasets considered, different parameters found by the AI methods have the most influence on predicting models of chloride resistance. Water, cement, W/B ratio, SCMs, chemical admixtures, aggregate content, and exposure conditions were severally considered by the literature as input variables in AI methods. However, there are differences of opinion about whether or not to consider some parameters. For instance, regarding cement type, although Delgado et al. (2020)52 reported the critical role as input illustrative in predicting chloride penetration in concrete, Taffese et al. (2022)55,61 showed its lack of influence as a predictor.

Table 3 Most vital features reported by AI models for predicting concrete chloride resistance.

Full size table

Moreover, the greatest numbers of the dataset did not precisely mention the cement type in the experimental program49. Accordingly, most developed AI models ignored this variable as an input feature. Also, only a few studies used concrete compressive strength as an input variable in AI methods4,40,50,51. Similarly, only Taffese et al. (2022)55 considered concrete density as one of the input variables to predict chloride resistance. Most studies also ignored the temperature, while many models considered it an input feature34,35,39,46,47,53,57,60. It may be due to the missing data in the experimental research. As mentioned in Table 3, there is no clear trend for the most important predictor in the models presented by the literature, so different input variables were reported. However, it can be deduced from Table 3 that parameters of cement content, SCMs content (FA and SF), curing time, and aggregate content (sand and coarse) should be considered for predicting models of concrete chloride resistance. Although other parameters of temperature, penetration depth, exposure time, and exposure condition were also found as vital factors affecting the predicting models, they cannot be considered definite input parameters due to the lack of data in the experimental dataset. The conflicting trend obtained by the previous predicting AI models (Table 3) can be attributed to some critical reasons, including (1) the number of the dataset collected in each model; (2) the types of concrete considered; (3) input variables selected for each model; (4) AI methods used to obtain a predicting model; and (5) existence of missing input data in some experimental program. Hence, although the literature presented valuable predicting models, they may not be well-suited to be considered reliable chloride resistance predicting models for all experimental databases considering different types of concrete.

Research significance

As reviewed in this section, introducing a unique model for each concrete type is impractical. Accordingly, a significant research gap should be filled by developing a novel AI method using all available datasets. Additionally, due to the lack of consistency and exitance of missing data in the details of mixtures tested by the literature, most of the previous predicting AI models ignored these valuable experimental datasets in their model, causing a significant reduction in the accuracy and comprehensiveness of the existing predicting models for all types of concrete. Accordingly, the present study uses the ANN method as a dataset arranging technique to practically predict missing details in datasets for the ML models. In other words, the ANN method helps to prepare an accurate and efficient dataset for ML input and prevents the deletion of valuable data. Additionally, prior studies have examined the forecasting of concrete's chloride migration coefficient, but there has been a lack of research on the integration of regression and classification tasks. Moreover, there is a scarcity of scholarly investigations pertaining to the advancement of artificial neural networks and machine learning techniques for the purpose of constructing a robust predictive model using a comprehensive dataset comprising over 1000 data points. Therefore, it is imperative to address a notable deficiency in research by devising an innovative artificial intelligence (AI) approach that effectively utilizes all accessible datasets. Hence, the present study intends to address this issue by following the objectives:

Developing a unique ML method to predict the chloride diffusion coefficient for all types of concrete.

Introducing a developed dataset-cleaning technique (DCT) to cover the missed database of all experimental works using ANN.

Compare classification algorithms with regression ones in the supervised learning method.

Finding the most and least influencing parameters controlling the predicting model for chloride diffusion coefficient.

To achieve these objectives, a comprehensive experimental database containing 1037 datasets was gathered in the present study to predict the ${D}_{nssm}$ using 12 various features of water content, cement, slag, FA, SF, fine aggregate, coarse aggregate, superplasticizer, fresh density, compressive strength test age, compressive strength, and migration test age (Fig. 1). Due to the missing data for fresh density and compressive strength, previous studies couldn’t practically use all dataset in a unique model, so that ML method applied on four separate groups. However, the present study intends to solve this issue using the ANN method to predict the missing data through a precise approach. Linear Regression (such as elastic net, lasso, Ridge), decision tree, random forest, boosting algorithms, support vector machine, and k-nearest neighbors algorithm were considered in the present study for regression method. Regarding the classification method, different algorithms were also selected for predicting ${D}_{nssm}$, including support vector machine, random forest, lightGBM, XGBoost, Logistic Regression, k-nearest neighbors (KNN), and decision tree.

Figure 1

Number of datasets used in the literature as compared to the present study.

Full size image

Machine learning method (ML)

Generally, the ML method is the branch of AI that deals with the application of algorithms that allow computers to advance patterns using the experimental database. The main intention of the ML method is to automatically learn to identify complicated patterns (relations between variables) and then make ingenious judgments based on datasets. The dataset is a set of logical (laboratory) records with unique characteristics called machine input data or features. Each of these input data is also called a predictor. Moreover, the intelligent decision that the machine should make after learning this data is considered a prediction model for a particular output (or target). In other words, the machine has reached intellectual maturity by finding a logical connection between the input data and the results. It provides a logical model of a real physical phenomenon. The main problem exists where the set of all possible behaviors given all potential inputs is too enormous to be covered by the set of observed examples (training data). Accordingly, the learner should generalize from the training data to produce expedient target data for new conditions out of the datasets. Pattern recognition, commonly accompanied by classification, is the most popular use case for the ML method. Although the quantity and quality of datasets play a significant role in the training process, selecting appropriate features or input variables (unique dataset characteristics) affects the ML model's performance considerably.

The present study comprehensively assesses the potential of using regression and classification algorithms in the ML method to predict the non-steady-state migration coefficients (${D}_{nssm}$) without any limitation for the concrete type or missing input data. Using such a substantial experimental database (1073 datasets) in the ML method is critical to achieving a unified model. A practical DCT is used in the present study to compensate for the missing data in the literature. The figures and machine learning models utilized in this study were implemented through the utilization of the Python programming language62. The ANN model in Figure S1 was generated using MATLAB (2019b)63 in this study. The flowchart of the ML methodology proposed in the present study is shown in Fig. 2. The ML method proposed in the present study consists of three main steps, including (1) data cleaning technique; (2) data visualization; and (3) ML models. Each of these steps is explained in the following subsections.

Figure 2

Flowchart presenting the methodology of the machine learning (ML) technique proposed in the present study.

Full size image

Dataset-cleaning technique (DCT)

Experimental datasets used in the present novel ML models after considering the outlier’s removal are summarized in Table 4. Totally 24 research papers containing 1073 datasets were collected in the present study. As shown in Fig. 3a, experimental databases gathered have three missing input features for some datasets, including fresh density, compressive strength test age, and concrete compressive strength. To address this issue, Taffese et al. (2022)55 divided the datasets into four separate groups with an experimental database lower than 200 datasets. However, this method cannot be efficient as the ML method is considerably affected by the quality and quantity of the dataset. Moreover, achieving a reliable unified model using all datasets is required. Accordingly, the feed-forward back propagation network and the Levenberg–Marquardt training function were used in the present study to predict the missing data using different hidden layers for each dataset group based on the missing parameter type (Fig. 3b). The accuracy of the ANN prediction is illustrated in the supplementary file (Figure S1). For this ANN method, water-to-binder (W/B) ratio, cement content, aggregate content, mineral admixtures, aggregate content, chemical admixtures, and ${D}_{nssm}$ were considered as input layers to predict the missing values for the concrete compressive strength and fresh density. Regarding this high and adequate accuracy of the ANN method, the problem of the missing dataset was entirely solved, and accordingly, the complete datasets were used in the ML method.

Table 4 Experimental datasets used in the present novel ML models after considering outlier’s removal.

Full size table

Figure 3

Dataset-cleaning technique (DCT) procedure: (a) different categories of literature regarding missing data; (b) Using the feed-forward back propagation network and the Levenberg–Marquardt training function for reproducing missing data.

Full size image

Data preprocessing

Generally, outliers are a few parts of datasets that are meaningfully different from the rest of the database. They are usually anomalous observations that deviate from the data distribution and are commonly caused by inconsistent data entry or inaccurate observations. The reason behind removing each outlier is necessary. One of the main methods to consider the removal of outlier datasets is to analyze with and without these skewed datasets and explain the differences. Descriptive statistics of the datasets considered in the present study before removing the outliers are summarized in Table 5. Regarding the output parameter (${D}_{nssm}$), a range of 0.22 to 133.6 (× 10−12) m2/s was gathered in the present study. However, only a few contents of datasets are in the range of ${D}_{nssm}>50$, and accordingly considered as outliers. Another analysis of the outlier’s detection is shown in Fig. 4 for each input variable. For superplasticizer (SP) dosage, SP values higher than 6 kg/m3 were removed from the datasets. A range of 200–1500 kg/m3 was kept as the main database of fine aggregates, while experimental databases out of this range were considered outliers (Fig. 4). Concrete compressive strength of higher than 67 MPa was removed due to the high concentration of datasets for ${f}_{c}\le 67$ MPa. Statistical analysis confirmed that the content of SF higher than 40 kg/m3 should be removed as outliers. Most of the datasets were in the range of 2000 days for migration test age, so that higher than this period are considered outliers (Fig. 4). Only 1 dataset of all experimental databases has slag content higher than 400 kg/m3, and accordingly removed from the ML analysis as an outlier data. Similarly, only one dataset has cement content higher than 510 kg/m3, which was removed from the ML method. Few datasets were selected as outliers for FA higher than 400 kg/m3. Although most experimental databases tested concrete compressive strength at a test age lower than 100 days, three tested the sample’s compressive strength more than 175 days’ measurement and were removed as outliers. Experimental datasets of coarse aggregate out of the 200–1300 kg/m3 range were selected as outliers (Fig. 4). For water content, this inliers range is about 50–250 kg/m3. However, no dataset was removed as an outlier data for fresh density of concrete samples. Analysis of the dataset for each input parameter before and after removing outliers is also illustrated in Fig. 4. Finally, a total amount of 965 datasets were obtained after removing outliers. As mentioned in Table 1, after checking the data preprocessing, the output parameter of ${D}_{nssm}$ was divided into five categories with data encoding ranging from 0 to 4.0 based on the recommendation of NT Build 49213 so that a higher amount of data encoding shows the higher chloride penetration resistance of concrete mixture with negligible chloride ion penetrability.

Table 5 Descriptive statistics of the input and target variables used in the ML models before outlier removal consideration.

Full size table

Figure 4

The procedure of outlier’s detection (left and middle ones show the dataset with outliers, right one represents the dataset without outliers).

Full size image

Data visualization

To visually represent the datasets, this section utilized several plots including distplot, heatmap, and joint plot kernel density estimation (KDE), employing Seaborn, a Python library developed specifically for data visualization. Distplotm or also distribution plot is used to characterize data in histogram form. Distplotm represents a univariant set of gathered data, describing the data distribution of each variable compared to another one. As shown in Fig. 5, the distribution plot of water content shows that most of the datasets have water content in the 160–180 kg/m3 range. However, this domain is larger for cement content ranging from 200 kg/m3 to 500 kg/m3. The common dosages of slag, FA, and SF used in the literature to study the ${D}_{nssm}$ were 100–150 kg/m3, 50–100 kg/m3, and 20–30 kg/m3, respectively (Fig. 5). Regarding fine aggregates content, the literature in the field of ${D}_{nssm}$ measurement mostly used 600–1000 kg/m3, with the highest consumption of around 800 kg/m3 in concrete mixtures, while no specific range was followed for the coarse aggregate content. Most literature used 2–4 kg/m3 SP dosage for chemical admixture mixtures. Commonly, the datasets used in the present study are divided into two separate groups with different densities, including (1) high number, so datasets belong to the first group where concrete samples have a density of 220–2400 kg/m3, and (2) some portion of samples has a density ranging from 1800–2000 kg/m3 (Fig. 5). Also, very few samples used are lightweight concrete samples (less than 25 datasets) with fresh density lower than 1600 kg/m3. Most of the datasets measured the 28-day compressive strength accompanied with ${f}_{c}$ ranging from 35 to 60 MPa in the majority of samples. Finally, the distplot curve confirms that most of the concrete samples tested in the literature has ${D}_{nssm}<50$.

Figure 5

Distribution plot of variables.

Full size image

To illustrate the relations between variables within a dataset and reveal valuable details from the datasets, the pairplot is shown in Fig. 6. This plot provides an immaculate conception to recognize the data. The scatter plot and Kernel Density Estimation (KDE) function are shown in the pairplot so that diagonal rows are the KDEs distribution curve, representing the distribution of each parameter. However, the scatter plots presented in other cells show the relationship between variables. Different colors were also used to distinguish the output parameter classifications (${D}_{nssm}$). For instance, the water content distribution plot for four classifications (based on Table 1 data encoding) of ${D}_{nssm}$ depicted in the first row and column of the pairplot, showing that the water content distribution is near the normal distribution for ${D}_{nssm}$ classifications of 3 and 4. Regarding concrete compressive strength, all ${D}_{nssm}$ classifications show the normal distribution. Although finding appropriate justification for each parameter in contact with other variables is complicated, valuable findings can be revealed by the pairplot (Fig. 6). For instance; it can be deduced from this plot that for higher water content, samples with higher fresh density can have high chloride resistance. Also, this plot shows the appropriate relations between different powders (SCMs) so that samples containing both FA and slag show data encoding of 3 and 4 as conditions of high chloride resistance situations. A similar justification for concrete containing both FA and slag (or SF) was found. Data visualization from the pairplot also indicates that FA has a higher impact on the chloride resistance of concrete samples with high ${f}_{c}>40$ MPa. Also, FA is more efficient in reducing chloride permeability for samples containing a high content of water and a low content of cement (Fig. 6). Moreover, pairplot shows that a good relationship exists between the fine aggregate content and fresh density, so using a high content of sand along with having high density causes better chloride resistance concrete. The Pairplot curve shows that cement and SCMs contents affect the effect of coarse aggregate on the chloride resistance of concrete. Based on the data visualization results, fresh density also controls the influence of coarse aggregate on the ${D}_{nssm}$. As depicted in Fig. 6, SP content has a meaningful impact on the chloride resistance of concrete with a high W/B ratio. Coarse aggregate content, fresh density, and compressive strength similarly improve the effect of SP on ${D}_{nssm}$. Generally, results indicate that the ternary relation of SP-fresh density-compressive strength should be considered in the concrete chloride resistance. The direct effect of fresh density on the concrete compressive strength is also clearly highlighted in the pairplot, which was notably ignored by the literature regarding AI methods.

Figure 6

Pairplot of variables.

Full size image

To illustrate the relationships between two variables, the heatmap plot is shown in Fig. 7. The Pearson correlation coefficient (r) is a statistical measure used to quantify the linear relationship between two continuous variables, X and Y. It provides a numerical value that indicates the strength and direction of the linear association between the variables. The calculation of the Pearson correlation coefficient involves several steps.

Figure 7

Heatmaps plot of variables.

Full size image

For each pair of values (Xi, Yi) in the dataset, the formula subtracts the mean of X from each Xi, and the mean of Y from each Yi. This step represents the deviation of each data point from its respective mean. Then, the formula multiplies the deviations of X and Y for each data point and sums up these products across the dataset. This step captures the covariation between X and Y. The resulting sum of products is divided by the product of the standard deviations of X and Y. The standard deviation of X is calculated by summing the squared deviations of X from its mean and taking the square root. Similarly, the standard deviation of Y is obtained by summing the squared deviations of Y from its mean and taking the square root. Changing cell’s color for each axis shows the patterns in value for one or both variables ranging from -1.0, as a perfect negative linear correlation between two variables, to 1.0 representing a perfect positive linear correlation. The value of 0 designates no linear correlation between the two features. On the other hand, the heatmap plot demonstrates the independence of the variables. For instance, the first row of this plot shows that water content seems to be positively correlated with the cement content, while it is negatively correlated with the slag content. For the second raw, it can be deduced from the plot that the cement content has a strong positive correlation with the water content (+ 0.57), along with negative correlations with the slag content (− 0.68) and the fresh density (− 0.43). Also, analysis of the heatmaps plot shows that coarse aggregate content has negatively correlated with the cement content (− 0.41) and significantly has a positive correlation with the fresh density value of mixtures. This is a significant finding for the coarse aggregate content in predicting ${D}_{nssm}$. As shown in Fig. 7, SP dosage has a positive + 0.51 dependency on the concrete compressive strength. SF content, fine aggregate, and SP dosage were also found to have a notable impact on the dependency of concrete compressive strength variable.

Kernel Density Estimation (KDE) jointplot for all variables against the ${D}_{nssm}$ is shown in Fig. 8. The KDE is a non-parametric approach showing the probability of the density of an independent variable. This plot contains two plots, including (1) a bivariate figure indicating how the dependent parameter (${D}_{nssm}$) changes with the variation of independent variable features; and (2) the scattering plot located at the top of the bivariate graph to display the distribution of the independent factors. KDE distribution jointplot shows that most chloride-resistant concrete mixtures with ${D}_{nssm}$ belongs to the classifications higher than 3 and has cement content ranging from 300 kg/m3 to 400 kg/m3 and a water content domain of 150–180 kg/m3 (Fig. 8). Among SCMs, slag content (maximum 50 kg/m3) shows better correlation as compared to FA and SF. Moreover, it can be deduced from the KDE jointplot that highly resistant concrete mixtures have fine aggregate and coarse range of 700–850 kg/m3 and 900–1200 kg/m3, respectively. However, no clear trend was obtained by the KDE jointplot between SP content and ${D}_{nssm}$ classifications. As illustrated in Fig. 8, concrete mixtures should have a fresh density range of 2100–2500 kg/m3 to resist the chloride attack. Finally, the KDE distribution jointplot revealed that high chloride-resistant concrete samples have a concrete compressive strength of 45–60 MPa.

Figure 8

Bivariate KDE jointplot of variables (darker colors represent higher densities).

Full size image

Data visualization plots revealed that the number of datasets for ${D}_{nssm}$ plays a significant role in the efficiency of the analysis. As shown in Fig. 9, category number 2 has the highest number of datasets ${D}_{nssm}$. Using this dataset to train the machine causes a misleading prediction model. Accordingly, the synthetic minority oversampling technique (SMOTE) is used in the present study to increase the number of datasets for training the machine. This technique is one of the most frequently used oversampling approaches to achieve a balanced training dataset. This technique balances the class distribution by erratically increasing minority class instances by duplicating them. On the other hand, using SMOTE causes the generation of 1750 training datasets by having an equal number of 350 datasets for each ${D}_{nssm}$ classification. Only 80 percent of all datasets were considered for training. It is worth mentioning that SMOTE was only used for training machines, while real experimental datasets were used for testing.

Figure 9

Synthetic Minority Oversampling Technique (SMOTE) on training data.

Full size image

Prior to commencing the machine learning model in data visualization, it is imperative to conduct a thorough examination of the feature importance plots for both regression and classification methodologies, as depicted in Fig. 10a,b correspondingly. Feature importance score represents the significance of each input feature for a ${D}_{nssm}$ model. A specific feature with a higher score has a more significant influence on the predicting model of ${D}_{nssm}$. Based on the regression approach, results of feature important show that superplasticizer dosage, fresh density, and water content have the highest impacts on the predicting regression model of ${D}_{nssm}$. However, compressive strength test age and FA have the lowest effects on the regression model of ${D}_{nssm}$ (Fig. 10a). Fresh density, coarse aggregate content, and fine aggregate content are the most crucial variables that affect the predicting model in the classification approach. Similar to the regression approach, FA and compressive strength test age have no effective impact on the predicting model of the classification approach. The importance of fresh density was confirmed in both regression and classification approaches.

Figure 10

Feature importance techniques for: (a) Regression; (b) Classification.

Full size image

Model learning

As shown in Fig. 11, the ML method contains three types of models, including (1) supervised learning, (2) unsupervised learning, and (3) reinforcement learning. Each type has various algorithms. Supervised learning is used in the present study to introduce a unified predicting model of ${D}_{nssm}$. Supervised techniques adjust the model to reproduce the target variable known from a training dataset. This type of ML has two main models, including (a) Regression and (b) Classification. The regression technique intends to reproduce the output or target value, while classification provides a model to produce the class assignments or data encoding. It can predict the target value, and the data is divided into different categories denoted as “classes.” Elastic Net, Lasso, Linear Regression, Ridge, Random Forest, KNN, Decision Tree, SVR, and XGB are different regression models used in the present study. Decision Tree, KNN, Logistic Regression, XGBoost, LightGBM, Random Forest, and Support Vector Machine are classification models considered in the present study. Schematic illustrations of some regression and classification models are shown in Fig. 12. Generally, linear Regression consists of four different linear regression models, including Simple, Ridge, Lasso, and Elastic Net. Linear Regression aims to predict a dependent output (or target) through different independent variables. Accordingly, this regression technique determines a linear relationship between a response variable and the other predictor variables (Fig. 12a). Support Vector Machines (SVM) as a robust classification and regression model convert the data into a linear decision space to classify datasets, even when the data are not linearly distinguishable. Support vectors are datasets nearer to the line (or hyperplane) and affect the location and direction of the hyperplane. A separator between the categories is created, and then the data are converted to draw the separator as a hyperplane (Fig. 12b). K-Nearest Neighbors (KNN) is a non-parametric and non-linear data classification method using proximity to perform classifications or estimations about the grouping of an individual data point (Fig. 12c). Logistic Regression is essentially a non-linear extension of linear Regression, allowing us to handle classes in a classification problem. This is achieved by categorizing estimates into a given class based on a likelihood threshold (Fig. 12d). Logistic Regression is a statistical analysis method for calculating the probability of a binary outcome, such as yes or no, based on former findings of a dataset. Logistic Regression affords discreet output. Finally, Decision Tree is a hierarchical demonstration of the outcome space where each node signifies a selection or an act. This model is used to classify or estimate how a former set of queries were responded to (Fig. 12e).

Figure 11

Machine learning algorithms.

Full size image

Figure 12

Schematic illustrations of regression and classification models: (a) Linear regression; (b) Support vector machines (SVM); (c) K-nearest neighbors (KNN); (d) Logistic regression; (e) Decision tree.

Full size image

Model evaluation

Different statistical criteria are used in the present study to determine the efficiency of regression and classification models. For the regression models, different coefficients are used in the present study, as follows:

$$RMSD = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {x_{i} - \hat{x}_{i} } \right)^{2} }}{N}}$$

(3)

$$MAE = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left| {y_{i} - x_{i} } \right|}}{n} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left| {e_{i} } \right|}}{n}$$

(4)

$$R^{2} = 1 - \frac{RSS}{{TSS}}$$

(5)

$$MSE = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {Y_{i} - \hat{Y}_{i} } \right)^{2}$$

(6)

Root Mean Square Error (RMSE) is the square root of the error function and should be reduced by the regression models to fit the datasets appropriately. Based on Eq. (3), N is the number of non-missing data points, ${x}_{i}$ is the actual observations time series, and ${\widehat{x}}_{i}$ is the estimated time series. As mentioned in Eq. (4), Mean Absolute Error (MAE) is used as the sum of the error values. However, as the absolute value is used instead of squaring it, it is more tolerant of large estimate errors. R Square (${R}^{2})$ calculates how much the regression model can describe changeability in a dependent variable. Although It is a respectable measurement to check the fit of dependent variables, no overfitting is considered in ${R}^{2}$. Based on Eq. (5), RSS and TSS are the sum of squares of residuals and the total sum of squares, respectively. As stated in Eq. (6), MSE is Mean Squared Error, which determines how close is a fitted line to datasets. In this equation, $n$ is the number of data points, ${Y}_{i}$ is observed values, and ${\widehat{Y}}_{i}$ is the predicted value. Along with these criteria, the accuracy of the regression models is also improved by the K-fold cross-validation (CV) and hyper-parameter tuning techniques (Fig. 13). Generally, the K-fold CV process is used to assess the efficiency of ML models when making estimates on datasets not used during training. CV is a resampling method used to assess ML models on limited data points. This method has a parameter called k that denotes the number of groups into which a given data sample should be divided. The procedure for using this method is mentioned in the flowchart shown in Fig. 2. Regarding the classification models as performance measurement (Fig. 14). Different performance metrics are used based on the confusion matrix to determine the efficiency of the classification models, including precision, recall, F1, accuracy, and specificity.

Figure 13

K-fold cross-validation (CV).

Full size image

Figure 14

Confusion matrix as a performance measurement for classification models.

Full size image

Results and discussion

The performance of the regression model in the ML method is summarized in Fig. 15. Figure 16 displays both the error graph and the predicted versus actual graph. The evaluation of the relationship between the actual and predicted results, as well as the error graph, was conducted for three algorithms with higher accuracy: Decision Tree, Random Forest, and XGBoost. The results are depicted in Fig. 16a,b,c respectively. Results indicate that among regression models, XGBoost (${R}^{2}=$ 0.94) and SVR (${R}^{2}=$ 0.94) show the highest accuracy. Also, ML analysis revealed that Elastic Net, Lasso, Linear regression, and Ridge could not precisely predict the ${D}_{nssm}$. Moreover, accepted ${R}^{2}$ score was found for Random Forest, KNN, and decision tree. It is worth mentioning that achieving a reliable model (XGBoost) with this high ${R}^{2}$ score