Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Purchase individual online access for 1 year to this journal.
Price: EUR 135.00Impact Factor 2023: 1.7
Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing.
In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.
Papers published in this journal are geared heavily towards applications, with an anticipated split of 70% of the papers published being applications-oriented, research and the remaining 30% containing more theoretical research. Manuscripts should be submitted in *.pdf format only. Please prepare your manuscripts in single space, and include figures and tables in the body of the text where they are referred to. For all enquiries regarding the submission of your manuscript please contact the IDA journal editor: editor@ida-ij.com
Authors: Modarres, Reza
Article Type: Research Article
Abstract: Distance or dissimilarity matrices are widely used in applications. We study the relationships between the eigenvalues of the distance matrices and outliers and show that outliers affect the pairwise distances and inflate the eigenvalues. We obtain the eigenvalues of a distance matrix that is affected by k outliers and compare them to the eigenvalues of a distance matrix with a constant structure. We show a discrepancy in the sizes of the eigenvalues of a distance matrix that is contaminated with outliers, present an algorithm and offer a new outlier detection method based on the eigenvalues of the …distance matrix. We compare the new distance-based outlier technique with several existing methods under five distributions. The methods are applied to a study of public utility companies and gene expression data. Show more
Keywords: Distance matrix, decomposition, eigenvalue, outlier, detection
DOI: 10.3233/IDA-230048
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-19, 2023
Authors: Tian, Qing | Cheng, Yao
Article Type: Research Article
Abstract: The aim of unsupervised domain adaptation (UDA) in person re-identification (re-ID) is to develop a model that can identify the same individual across different cameras in the target domain, using labeled data from the source domain and unlabeled data from the target domain. However, existing UDA person re-ID methods typically assume a single source domain and a single target domain, and seldom consider the scenario of multiple source domains and a single target domain. In the latter scenario, differences in sample size between domains can lead to biased training of the model. To address this, we propose an unsupervised multi-source …domain adaptation person re-ID method via sample weighting. Our approach utilizes multiple source domains to leverage valuable label information and balances the inter-domain sample imbalance through sample weighting. We also employ an adversarial learning method to align the domains. The experimental results, conducted on four datasets, demonstrate the effectiveness of our proposed method. Show more
Keywords: Person re-identification, unsupervised domain adaptation, sample weighting, unsupervised multi-source domain adaptation
DOI: 10.3233/IDA-230178
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-18, 2023
Authors: Malik, Muhammad Shahid Iqbal | Nawaz, Aftab | Jamjoom, Mona Mamdouh | Ignatov, Dmitry I.
Article Type: Research Article
Abstract: Online product reviews (OPR) are a commonly used medium for consumers to communicate their experiences with products during online shopping. Previous studies have investigated the helpfulness of OPRs using frequency-based, linguistic, meta-data, readability, and reviewer attributes. In this study, we explored the impact of robust contextual word embeddings, topic, and language models in predicting the helpfulness of OPRs. In addition, the wrapper-based feature selection technique is employed to select effective subsets from each type of features. Five feature generation techniques including word2vec, FastText, Global Vectors for Word Representation (GloVe), Latent Dirichlet Allocation (LDA), and Embeddings from Language Models (ELMo), were …employed. The proposed framework is evaluated on two Amazon datasets (Video games and Health & personal care). The results showed that the ELMo model outperformed the six standard baselines, including the fine-tuned Bidirectional Encoder Representations from Transformers (BERT) model. In addition, ELMo achieved Mean Square Error (MSE) of 0.0887 and 0.0786 respectively on two datasets and MSE of 0.0791 and 0.0708 with the wrapper method. This results in the reduction of 1.43% and 1.63% in MSE as compared to the fine-tuned BERT model on respective datasets. However, the LDA model has a comparable performance with the fine-tuned BERT model but outperforms the other five baselines. The proposed framework demonstrated good generalization abilities by uncovering important factors of product reviews and can be evaluated on other voting platforms. Show more
Keywords: Word2vec, ELMo, LDA, helpfulness prediction, semantic model, Amazon
DOI: 10.3233/IDA-230349
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-21, 2023
Authors: Wang, Sijie | Li, Yifei | Chen, Diansheng | Li, Jiting | Zhang, Xiaochuan
Article Type: Research Article
Abstract: Due to the multiple types of objects and the uncertainty of their geometric structures and scales in indoor scenes, the position and pose estimation of point clouds of indoor objects by mobile robots has the problems of domain gap, high learning cost, and high computing cost. In this paper, a lightweight 6D pose estimation method is proposed, which decomposes the pose estimation into a viewpoint and the in-plane rotation around the optical axis of the viewpoint, and the improved PointNet+ + network structure and two lightweight modules are used to construct a codebook, and the …6d pose estimation of the point cloud of the indoor objects is completed by building and querying the codebook. The model was trained on the ShapeNetV2 dataset, and reports the ADD-S metric validation on the YCB-Video and LineMOD datasets, reaching 97.0% and 94.6% respectively. The experiment shows that the model can be trained to estimate the 6d position and pose of the unknown object point cloud with lower computation and storage cost, and the model with fewer parameters and better real-time performance is superior to other high-recision methods. Show more
Keywords: Domain adaptation, 6d pose estimation, lightweight neural network, indoor scene, mobile robot
DOI: 10.3233/IDA-230278
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-12, 2023
Authors: Meng, Shiting | Hao, Qingbo | Xiao, Yingyuan | Zheng, Wenguang
Article Type: Research Article
Abstract: Convolutional neural networks (CNNs) have been successfully applied to music genre classification tasks. With the development of diverse music, genre fusion has become common. Fused music exhibits multiple similar musical features such as rhythm, timbre, and structure, which typically arise from the temporal information in the spectrum. However, traditional CNNs cannot effectively capture temporal information, leading to difficulties in distinguishing fused music. To address this issue, this study proposes a CNN model called MusicNeXt for music genre classification. Its goal is to enhance the feature extraction method to increase focus on musical features, and increase the distinctiveness between different genres, …thereby reducing classification result bias. Specifically, we construct the feature extraction module which can fully utilize temporal information, thereby enhancing its focus on music features. It exhibits an improved understanding of the complexity of fused music. Additionally, we introduce a genre-sensitive adjustment layer that strengthens the learning of differences between different genres through within-class angle constraints. This leads to increased distinctiveness between genres and provides interpretability for the classification results. Experimental results demonstrate that our proposed MusicNeXt model outperforms baseline networks and other state-of-the-art methods in music genre classification tasks, without generating category bias in the classification results. Show more
Keywords: Music genre classification, spectrogram, deep learning, L-softmax loss
DOI: 10.3233/IDA-230428
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-15, 2023
Authors: Yu, Mingxin | Wang, Jun | You, Rui | Ji, Xinglong | Lu, Wenshuai
Article Type: Research Article
Abstract: Person re-identification (ReID) is widely used in intelligent security, monitoring, criminal investigation and other fields. Aiming at the problems of local occlusion, scale misalignment and attitude change of pedestrian images in actual scenes, we propose a Multi-local Feature and Attention fused network (MFA) used for person re-identification task. Firstly, Channel Point Affinity Attention module (CPAA) is embedded in the backbone network to enhance the ability of the network for extracting local details. The feature map output from the backbone network is horizontally segmented into four local feature maps, and further four branch networks are concatenated to the feature map of …the backbone network. The four local feature maps are used to guide the four branch networks to pay more attention on different areas of pedestrians through Global Local Aligned loss (GLA) function. Finally, the pedestrian feature vector containing multi-local features is obtained. The mAP of the network on Market-1501, DukeMTMC-reID,CUHK03 and MSMT17 datasets were 88.6%, 81.4%, 79.5% and 64.7%, and the Rank-1 was 95.8%, 90.1%, 81.2% and 84.1% respectively. In addition, the model also obtained 73.2% and 68.1% of Rank-1 on partial dataset Patial-REID and Patial-iLIDS, respectively. Recently, The MFA model parameter is 28.3M and the inference efficiency is approximately 32 fps to an image with a resulation of 256 × 128. Compared with other ReID methods, our proposed methods achieved a competitive performance for ReID task. The code was available at github:git@github.com:ISCLab-Bistu/MFA.git. Show more
Keywords: Person re-identification, attention mechanism, local feature, multi branches network, deep learning
DOI: 10.3233/IDA-230392
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-17, 2024
Authors: Zhang, Fu | Zhang, Wei | Wang, Gang
Article Type: Research Article
Abstract: The Resource Description Framework (RDF) is a framework for expressing information about resources in the form of triples (subject , predicate , object ). The information represented by the standard RDF is static, i . e . , that does not change over time. To better deal with a large amount of time-related information, temporal RDF is proposed. Consequently, how to explore index technology to efficiently query temporal information has become an important research issue, but the research on the index of temporal RDF is still short, especially the index of bitemporal RDF. Bitemporal RDF …can represent more complicated situations (e.g., RDF triples with both valid time and transaction time ). Indexes for bitemporal RDF can further expand the application scenarios and functions of temporal RDF. In this paper, we propose an efficient index for bitemporal RDF queries. The index innovatively introduces and re-designs skip list structure into the bitemporal RDF query. We also investigate how to cover almost all query patterns with as few indexes as possible. In addition, although the proposed index is conceived for temporal RDF, it also takes into account the performance of standard RDF queries when the time element is unknown. Finally, we run experiments with synthetic data sets of different sizes using the Lehigh University Benchmark (LUBM), and results prove that the proposed index is scalable and effective. Show more
Keywords: Resource Description Framework (RDF), temporal RDF, bitemporal RDF, index
DOI: 10.3233/IDA-230609
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-21, 2024
Authors: Malhotra, Ruchika | Cherukuri, Madhukar
Article Type: Research Article
Abstract: BACKGROUND: Software quality prediction models play a crucial role in identifying vulnerable software components during early stages of development, and thereby optimizing the resource allocation and enhancing the overall software quality. While various classification algorithms have been employed for developing these prediction models, most studies have relied on default hyperparameter settings, leading to significant variability in model performance. Tuning the hyperparameters of classification algorithms can enhance the predictive capability of quality models by identifying optimal settings for improved accuracy and effectiveness. METHOD: This systematic review examines studies that have utilized hyperparameter tuning techniques to develop prediction …models in software quality domain. The review focused on diverse areas such as defect prediction, maintenance estimation, change impact prediction, reliability prediction, and effort estimation, as these domains demonstrate the wide applicability of common learning algorithms. RESULTS: This review identified 31 primary studies on hyperparameter tuning for software quality prediction models. The results demonstrate that tuning the parameters of classification algorithms enhances the performance of prediction models. Additionally, the study found that certain classification algorithms exhibit high sensitivity to their parameter settings, achieving optimal performance when tuned appropriately. Conversely, certain classification algorithms exhibit low sensitivity to their parameter settings, making tuning unnecessary in such instances. CONCLUSION: Based on the findings of this review, the study conclude that the predictive capability of software quality prediction models can be significantly improved by tuning their hyperparameters. To facilitate effective hyperparameter tuning, we provide practical guidelines derived from the insights obtained through this study. Show more
Keywords: Hyperparameter tuning, machine learning, defect prediction, effort estimation, maintenance prediction, reliability
DOI: 10.3233/IDA-230653
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-19, 2024
Authors: Zhang, Xu | Hu, Xiaoyu | Liu, Zejie | Xiang, Yanzheng | Zhou, Deyu
Article Type: Research Article
Abstract: Text-to-SQL, a computational linguistics task, seeks to facilitate the conversion of natural language queries into SQL queries. Recent methodologies have leveraged the concept of slot-filling in conjunction with predetermined SQL templates to effectively bridge the semantic gap between natural language questions and structured database queries, achieving commendable performance by harnessing the power of multi-task learning. However, employing identical features across diverse tasks is an ill-suited practice, fraught with inherent drawbacks. Firstly, based on our observation, there are clear boundaries in the natural language corresponding to SELECT and WHERE clauses. Secondly, the exclusive features integral to each subtask are inadequately emphasized …and underutilized, thereby hampering the acquisition of discriminative features for each specific subtask. In an endeavor to rectify these issues, the present work introduces an innovative approach: the hierarchical feature decoupling model for SQL query generation from natural language. This novel approach involves the deliberate separation of features pertaining to subtasks within both SELECT and WHERE clauses, further dissociating these features at the subtask level to foster better model performance. Empirical results derived from experiments conducted on the WikiSQL benchmark dataset reveal the superiority of the proposed approach over several state-of-the-art baseline methods in the context of text-to-SQL query generation. Show more
Keywords: Text-to-SQL, multi-task learning, discriminative features, feature decoupling
DOI: 10.3233/IDA-230390
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-15, 2024
Authors: Al-Jumaili, Ahmed Hadi Ali | Muniyandi, Ravie Chandren | Hasan, Mohammad Kamrul | Singh, Mandeep Jit | Paw, Johnny Koh Siaw | Al-Jumaily, Abdulmajeed
Article Type: Research Article
Abstract: Parallel power loads anomalies are processed by a fast-density peak clustering technique that capitalizes on the hybrid strengths of Canopy and K-means algorithms all within Apache Mahout’s distributed machine-learning environment. The study taps into Apache Hadoop’s robust tools for data storage and processing, including HDFS and MapReduce, to effectively manage and analyze big data challenges. The preprocessing phase utilizes Canopy clustering to expedite the initial partitioning of data points, which are subsequently refined by K-means to enhance clustering performance. Experimental results confirm that incorporating the Canopy as an initial step markedly reduces the computational effort to process the vast quantity …of parallel power load abnormalities. The Canopy clustering approach, enabled by distributed machine learning through Apache Mahout, is utilized as a preprocessing step within the K-means clustering technique. The hybrid algorithm was implemented to minimise the length of time needed to address the massive scale of the detected parallel power load abnormalities. Data vectors are generated based on the time needed, sequential and parallel candidate feature data are obtained, and the data rate is combined. After classifying the time set using the canopy with the K-means algorithm and the vector representation weighted by factors, the clustering impact is assessed using purity, precision, recall, and F value. The results showed that using canopy as a preprocessing step cut the time it proceeds to deal with the significant number of power load abnormalities found in parallel using a fast density peak dataset and the time it proceeds for the k-means algorithm to run. Additionally, tests demonstrate that combining canopy and the K-means algorithm to analyze data performs consistently and dependably on the Hadoop platform and has a clustering result that offers a scalable and effective solution for power system monitoring. Show more
Keywords: Power load data, abnormality detection and adjustment, hybrid (CKMA), K-means algorithm (KMA), canopy algorithm (CA), Apache Mahout
DOI: 10.3233/IDA-230573
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-26, 2024
Authors: Zhou, Rucheng | Zhang, Dongmei | Zhu, Jiabao | Min, Geyong
Article Type: Research Article
Abstract: Traffic forecasting has become a core component of Intelligent Transportation Systems. However, accurate traffic forecasting is very challenging, caused by the complex traffic road networks. Most existing forecasting methods do not fully consider the topological structure information of road networks, making it difficult to extract accurate spatial features. In addition, spatial and temporal features have different impacts on traffic conditions, but the existing studies ignore the distribution of spatial-temporal features in traffic regions. To address these limitations, we propose a novel graph neural network architecture named Attention-based Spatial-Temporal Adaptive Integration Gated Network (AST-AIGN). The originality of AST-AIGN is to obtain …a spatial feature that more accurately reflects the topological structure of the road networks by embedding Graph Attention Network (GAT) into Jumping Knowledge Net (JK-Net). We propose a data-dependent function called spatial-temporal adaptive integration gate to process the diversity of feature distribution and highlight features in road networks that significantly affects traffic conditions. We evaluate our model on two real-world traffic datasets from the Caltrans Performance Measurement System (PEMS04 and PEMS08), and the extensive experimental results demonstrate the proposed AST-AIGN architecture outperforms other baselines. Show more
Keywords: Traffic forecasting, spatial-temporal dependences, jumping knowledge, gating mechanism, self-attention
DOI: 10.3233/IDA-230101
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024
Authors: Yu, Dongjin | Ni, Ke | Li, Zhongyang | Zhang, Shengyi | Sun, Xiaoxiao | Hou, Wenjie | Ying, Yuke
Article Type: Research Article
Abstract: Process discovery techniques analyze process logs to extract models that characterize the behavior of business processes. In real-life logs, however, noises exist and adversely affect the extraction and thus decrease the understandability of discovered models. In this paper, we propose a novel double granularity filtering method, executed on both the event and trace levels, to detect noises by analyzing the directly-following and parallel relations between events. Based on the probability of an event occurring in a sequence, the infrequent behaviors and redundant events in the logs can be filtered out. In addition, the missing events in parallel blocks are detected …to further improve the performance of filtering. Experiments on synthetic logs and five real-life datasets demonstrate that our method significantly outperforms other state-of-the-art methods. Show more
Keywords: Process discovery, process mining, event logs, noise filtering, event dependency, parallel relation
DOI: 10.3233/IDA-230118
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-18, 2024
Authors: Belbekri, Adel | Benchikha, Fouzia | Slimani, Yahya | Marir, Naila
Article Type: Research Article
Abstract: Named Entity Recognition (NER) is an essential task in Natural Language Processing (NLP), and deep learning-based models have shown outstanding performance. However, the effectiveness of deep learning models in NER relies heavily on the quality and quantity of labeled training datasets available. A novel and comprehensive training dataset called SocialNER2.0 is proposed to address this challenge. Based on selected datasets dedicated to different tasks related to NER, the SocialNER2.0 construction process involves data selection, extraction, enrichment, conversion, and balancing steps. The pre-trained BERT (Bidirectional Encoder Representations from Transformers) model is fine-tuned using the proposed dataset. Experimental results highlight the superior …performance of the fine-tuned BERT in accurately identifying named entities, demonstrating the SocialNER2.0 dataset’s capacity to provide valuable training data for performing NER in human-produced texts. Show more
Keywords: Big data, deep learning, user-generated texts, text analysis, named entity recognition
DOI: 10.3233/IDA-230588
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024
Authors: Hu, Haiping | Huo, Wei | Yan, Yingying | Zhu, Qiuyu
Article Type: Research Article
Abstract: For the pattern recognition, most classification models are solved iteratively, except for Linear LDA, KLDA and ELM etc. In this paper, a nonlinear classification network model based on predefined evenly-distributed class centroids (PEDCC) is proposed. Its analytical solution can be obtained and has good interpretability. Using the characteristics of maximizing the inter-class distance of PEDCC and derivative weighted minimum mean square error loss function to minimize the intra-class distance, we can not only realize the effective nonlinearity of the network, but also obtain the analytical solution of the network weight. Then, the sample is classified based on GDA. In order …to further improve the performance of classification, PCA is used to reduces the dimensionality of the original sample, meanwhile, the CReLU activation function are adopted to enhances the expression ability of the features. The network transforms the samples into the higher dimensional feature space through the weighted minimum mean square error, so as to find a better separation hyperplane. In experiments, the feasibility of the network structure is verified from pure linear 𝑾 , 𝑾 + Tanh, and PCA+ 𝑾 + Tanh respectively on many small data sets and large data sets, and compared with SVM and ELM in terms of training speed and recognition rate. The results show that, in general, this model has advantages on small data set both in recognition accuracy and training speed, while it has advantages in training speed on large data sets. Finally, by introducing a multi-stage network structure based on the latent feature norm, the classifier network can further significantly improve the classification performance, the recognition rate of small data sets is effectively improved and much higher than that of existing methods, while the recognition rate of large data sets is similar to that of SVM. Show more
Keywords: Pattern recognition, image classification, machine learning, GDA
DOI: 10.3233/IDA-230044
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-16, 2024
Authors: Cao, Jinhui | Di, Xiaoqiang | Liu, Xu | Xu, Rui | Li, Jinqing | Ren, Weiwu | Qi, Hui | Hu, Pengfei | Zhang, Kehan | Li, Bo
Article Type: Research Article
Abstract: Logs play an important role in anomaly detection, fault diagnosis, and trace checking of software and network systems. Log parsing, which converts each raw log line to a constant template and a variable parameter list, is a prerequisite for system security analysis. Traditional parsing methods utilizing specific rules can only parse logs of specific formats, and most parsing methods based on deep learning require labels. However, the existing parsing methods are not applicable to logs of inconsistent formats and insufficient labels. To address these issues, we propose a robust Log parsing method based on Self-supervised Learning (LogSL), which can extract …templates from logs of different formats. The essential idea of LogSL is modeling log parsing as a multi-token prediction task, which makes the multi-token prediction model learn the distribution of tokens belonging to the template in raw log lines by self-supervision mode. Furthermore, to accurately predict the tokens of the template without labeled data, we construct a Multi-token Prediction Model (MPM) combining the pre-trained XLNet module, the n-layer stacked Long Short-Term Memory Net module, and the Self-attention module. We validate LogSL on 12 benchmark log datasets, resulting in the average parsing accuracy of our parser being 3.9% higher than that of the best baseline method. Experimental results show that LogSL has superiority in terms of robustness and accuracy. In addition, a case study of anomaly detection is conducted to demonstrate the support of the proposed MPM to system security tasks based on logs. Show more
Keywords: System security, data analysis, log parsing, deep learning, self-supervised learning
DOI: 10.3233/IDA-230133
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-21, 2024
Authors: Nayancy, | Dutta, Sandip | Chakraborty, Soubhik
Article Type: Research Article
Abstract: Blockchain has attracted tremendous attention in recent years due to its significant features including anonymity, security, immutability, and audibility. Blockchain technology has been used in several nonmonetary applications, including Internet-of-Things. Though blockchain has limited resources, and scalability is computationally expensive, resulting in delays and large bandwidth overhead that are unsuitable for many IoT devices. In this paper, we work on a lightweight blockchain approach that is suited for IoT needs and provides end-to-end security. Decentralization is achieved in our lightweight blockchain implementation by building a network with a lot of high-resource devices collaborate to maintain the blockchain. The nodes in …the network is arranged in sorted order w.r.t execution time and count to reduce the mining overheads and is accountable for handling the public blockchain. We propose a distributed execution time-based consensus algorithm that decreases the delay and overhead of the mining process. We also propose a randomized node-selection algorithm for the selection of nodes to verify the mined blocks to eliminate the double-spend and 51% attack. The results are encouraging and significantly reduce the mining overhead and keep a check on the double-spending problem and 51% attack. Show more
Keywords: Blockchain, IoT, lightweight consensus, double-spend attack, 51% attack
DOI: 10.3233/IDA-230153
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-11, 2024
Authors: Boullé, Marc
Article Type: Research Article
Abstract: Histograms are among the most popular methods used in exploratory analysis to summarize univariate distributions. In particular, irregular histograms are good non-parametric density estimators that require very few parameters: the number of bins with their lengths and frequencies. Although many approaches have been proposed in the literature to infer these parameters, most existing histogram methods are difficult to exploit for exploratory analysis in the case of real-world data sets, with scalability issues, truncated data, outliers or heavy-tailed distributions. In this paper, we focus on the G-Enum histogram method, which exploits the Minimum Description Length (MDL) principle to build histograms without …any user parameter. We then propose to extend this method by exploiting a new modeling space based on floating-point representation, with the objective of building histograms resistant to outliers or heavy-tailed distributions. We also suggest several heuristics and a methodology suitable for the exploratory analysis of large scale real-world data sets, whose underlying patterns are difficult to recover for digitization reasons. Extensive experiments show the benefits of the approach, evaluated with a dual objective: the accuracy of density estimation in the case of outliers or heavy-tailed distributions, and the effectiveness of the approach for exploratory data analysis. Show more
Keywords: Density estimation, histograms, model selection, minimum description length, exploratory analysis
DOI: 10.3233/IDA-230638
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-48, 2024
Authors: Shi, Xuefeng | Hu, Min | Ren, Fuji | Shi, Piao
Article Type: Research Article
Abstract: Active Learning (AL) is a technique being widely employed to minimize the time and labor costs in the task of annotating data. By querying and extracting the specific instances to train the model, the relevant task’s performance is improved maximally within limited iterations. However, rare work was conducted to fully fuse features from different hierarchies to enhance the effectiveness of active learning. Inspired by the thought of information compensation in many famous deep learning models (such as ResNet, etc.), this work proposes a novel TextCNN-based Two ways Active Learning model (TCTWAL) to extract task-relevant texts. TextCNN takes the advantage of …little hyper-parameter tuning and static vectors and achieves excellent results on various natural language processing (NLP) tasks, which are also beneficial to human-computer interaction (HCI) and the AL relevant tasks. In the process of the proposed AL model, the candidate texts are measured from both global and local features by the proposed AL framework TCTWAL depending on the modified TextCNN. Besides, the query strategy is strongly enhanced by maximum normalized log-probability (MNLP), which is sensitive to detecting the longer sentences. Additionally, the selected instances are characterized by general global information and abundant local features simultaneously. To validate the effectiveness of the proposed model, extensive experiments are conducted on three widely used text corpus, and the results are compared with with eight manual designed instance query strategies. The results show that our method outperforms the planned baselines in terms of accuracy, macro precision, macro recall, and macro F1 score. Especially, to the classification results on AG’s News corpus, the improvements of the four indicators after 39 iterations are 40.50%, 45.25%, 48.91%, and 45.25%, respectively. Show more
Keywords: Active learning, TextCNN, maximum normalized log-probability, global information, local feature
DOI: 10.3233/IDA-230332
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-23, 2024
Authors: Zhong, Qing | Shao, Xinhui
Article Type: Research Article
Abstract: For the aspect-based sentiment analysis task, traditional works are only for text modality. However, in social media scenarios, texts often contain abbreviations, clerical errors, or grammatical errors, which invalidate traditional methods. In this study, the cross-model hierarchical interactive fusion network incorporating an end-to-end approach is proposed to address this challenge. In the network, a feature attention module and a feature fusion module are proposed to obtain the multimodal interaction feature between the image modality and the text modality. Through the attention mechanism and gated fusion mechanism, these two modules realize the auxiliary function of image in the text-based aspect-based sentiment …analysis task. Meanwhile, a boundary auxiliary module is used to explore the dependencies between two core subtasks of the aspect-based sentiment analysis. Experimental results on two publicly available multi-modal aspect-based sentiment datasets validate the effectiveness of the proposed approach. Show more
Keywords: Multimodal aspect-based sentiment analysis, hierarchical interactive fusion, multi-head interaction attention mechanism, gated mechanism
DOI: 10.3233/IDA-230305
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-16, 2024
Authors: Chen, Hongwei | Shi, Dewei | Zhou, Xun | Zhang, Man | Liu, Luanxuan
Article Type: Research Article
Abstract: Credit fraud is a common financial crime that causes significant economic losses to financial institutions. To address this issue, researchers have proposed various fraud detection methods. Recently, research on deep forests has opened up a new path for exploring deep models beyond neural networks. It combines the features of neural networks and ensemble learning, and has achieved good results in various fields. This paper mainly studies the application of deep forests to the field of fraud detection and proposes a distributed dense rotation deep forest algorithm (DRDF-spark) based on the improved RotBoost. The model has three main characteristics: firstly, it …solves the problem of multi-granularity scanning due to the lack of spatial correlation in the data by introducing RotBoost. Secondly, Spark is used for parallel construction to improve the processing speed and efficiency of data. Thirdly, a pre-aggregation mechanism is added to the distributed algorithm to locally aggregate the statistical results of sub-forests in the same node in advance to improve communication efficiency. The experiments show that DRDF-spark performs better than deep forests and some mainstream ensemble learning algorithms on the fraud dataset in this paper, and the training speed is up to 3.53 times faster. Furthermore, if the number of nodes is further increased, the speedup ratio will continue to increase. Show more
Keywords: Deep forest, credit fraud detection, ensemble learning, RotBoost, spark
DOI: 10.3233/IDA-230193
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024
Authors: Jiménez-Gaona, Yuliana | Rodríguez-Alvarez, María José | Escudero, Líder | Sandoval, Carlos | Lakshminarayanan, Vasudevan
Article Type: Research Article
Abstract: INTRODUCTION: Ultrasound in conjunction with mammography imaging, plays a vital role in the early detection and diagnosis of breast cancer. However, speckle noise affects medical ultrasound images and degrades visual radiological interpretation. Speckle carries information about the interactions of the ultrasound pulse with the tissue microstructure, which generally causes several difficulties in identifying malignant and benign regions. The application of deep learning in image denoising has gained more attention in recent years. OBJECTIVES: The main objective of this work is to reduce speckle noise while preserving features and details in breast ultrasound images using GAN models. …METHODS: We proposed two GANs models (Conditional GAN and Wasserstein GAN) for speckle-denoising public breast ultrasound databases: BUSI, DATASET A, AND UDIAT (DATASET B). The Conditional GAN model was trained using the Unet architecture, and the WGAN model was trained using the Resnet architecture. The image quality results in both algorithms were measured by Peak Signal to Noise Ratio (PSNR, 35–40 dB) and Structural Similarity Index (SSIM, 0.90–0.95) standard values. RESULTS: The experimental analysis clearly shows that the Conditional GAN model achieves better breast ultrasound despeckling performance over the datasets in terms of PSNR = 38.18 dB and SSIM = 0.96 with respect to the WGAN model (PSNR = 33.0068 dB and SSIM = 0.91) on the small ultrasound training datasets. CONCLUSIONS: The observed performance differences between CGAN and WGAN will help to better implement new tasks in a computer-aided detection/diagnosis (CAD) system. In future work, these data can be used as CAD input training for image classification, reducing overfitting and improving the performance and accuracy of deep convolutional algorithms. Show more
Keywords: Breast cancer, ultrasound image denoising, generative adversarial network
DOI: 10.3233/IDA-230631
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-18, 2024
Authors: Göcs, László | Johanyák, Zsolt Csaba
Article Type: Research Article
Abstract: Intrusion detection systems (IDSs) are essential elements of IT systems. Their key component is a classification module that continuously evaluates some features of the network traffic and identifies possible threats. Its efficiency is greatly affected by the right selection of the features to be monitored. Therefore, the identification of a minimal set of features that are necessary to safely distinguish malicious traffic from benign traffic is indispensable in the course of the development of an IDS. This paper presents the preprocessing and feature selection workflow as well as its results in the case of the CSE-CIC-IDS2018 on AWS dataset, focusing …on five attack types. To identify the relevant features, six feature selection methods were applied, and the final ranking of the features was elaborated based on their average score. Next, several subsets of the features were formed based on different ranking threshold values, and each subset was tried with five classification algorithms to determine the optimal feature set for each attack type. During the evaluation, four widely used metrics were taken into consideration. Show more
Keywords: Ddataset preprocessing, dimension reduction, feature selection, classification, Python, CE-CIC-IDS2018
DOI: 10.3233/IDA-230264
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-27, 2024
Authors: Zhang, Shuo | Hu, Xingbang | Zhang, Wenbo | Chen, Jinyi | Huang, Hejiao
Article Type: Research Article
Abstract: For modern Intelligent Transportation System (ITS), data missing during traffic raster acquisition can be inevitable because of the loop detector malfunction or signal interference. Nevertheless, missing data imputation is meaningful due to the periodic spatio-temporal characteristics and individual randomness of traffic raster data. In this paper, traffic raster data collected from all spatial regions at each time interval are considered as a multiple channel image. Accordingly, the traffic raster data over a period of time can be regarded as video, on which an unsupervised generative neural network called MSST-VAE (Multiple Streams Spatial Temporal-VAE) is proposed for traffic raster data imputation, …and this model can even robustly performs at varied missing rates while many other approaches fail to conduct. Two major innovations can be summarized in MSSTVAE: Firstly, it uses multiple periodic streams of Variational Auto-Encoders (VAEs) with Sylvester Normalizing Flows (SNFs), which shows strong generalization ability. Secondly, after the traffic raster data are transferred into videos, an ECB (Extraction-and-Calibration Block) consisting of dilated P3D gated convolution and multi-horizon attention mechanism is employed to learn global-local-granularity spatial features and long-short-term temporal features. Extensive experiments on three real traffic flow datasets validate that MSST-VAE outperforms other classical traffic imputation models with the least imputation error. Show more
Keywords: Intelligent transportation system, traffic raster data, data imputation
DOI: 10.3233/IDA-230091
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-22, 2024
Authors: Chen, Mingcai | Du, Yuntao | Tang, Wei | Zhang, Baoming | Wang, Chongjun
Article Type: Research Article
Abstract: Real-world machine learning applications seldom provide perfect labeled data, posing a challenge in developing models robust to noisy labels. Recent methods prioritize noise filtering based on the discrepancies between model predictions and the provided noisy labels, assuming samples with minimal classification losses to be clean. In this work, we capitalize on the consistency between the learned model and the complete noisy dataset, employing the data’s rich representational and topological information. We introduce LaplaceConfidence, a method that to obtain label confidence (i.e., clean probabilities) utilizing the Laplacian energy. Specifically, it first constructs graphs based on the feature representations of all noisy …samples and minimizes the Laplacian energy to produce a low-energy graph. Clean labels should fit well into the low-energy graph while noisy ones should not, allowing our method to determine data’s clean probabilities. Furthermore, LaplaceConfidence is embedded into a holistic method for robust training, where co-training technique generates unbiased label confidence and label refurbishment technique better utilizes it. We also explore the dimensionality reduction technique to accommodate our method on large-scale noisy datasets. Our experiments demonstrate that LaplaceConfidence outperforms state-of-the-art methods on benchmark datasets under both synthetic and real-world noise. Code available at https://github.com/chenmc1996/LaplaceConfidence . Show more
Keywords: Learning with noisy labels, graph energy, label refurbishment
DOI: 10.3233/IDA-230818
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-17, 2024
Authors: Fabra-Boluda, Raül | Ferri, Cèsar | Hernández-Orallo, José | Ramírez-Quintana, M. José | Martínez-Plumed, Fernando
Article Type: Research Article
Abstract: The quest for transparency in black-box models has gained significant momentum in recent years. In particular, discovering the underlying machine learning technique type (or model family) from the performance of a black-box model is a real important problem both for better understanding its behaviour and for developing strategies to attack it by exploiting the weaknesses intrinsic to the learning technique. In this paper, we tackle the challenging task of identifying which kind of machine learning model is behind the predictions when we interact with a black-box model. Our innovative method involves systematically querying a black-box model (oracle) to …label an artificially generated dataset, which is then used to train different surrogate models using machine learning techniques from different families (each one trying to partially approximate the oracle’s behaviour). We present two approaches based on similarity measures, one selecting the most similar family and the other using a conveniently constructed meta-model. In both cases, we use both crisp and soft classifiers and their corresponding similarity metrics. By experimentally comparing all these methods, we gain valuable insights into the explanatory and predictive capabilities of our model family concept. This provides a deeper understanding of the black-box models and increases their transparency and interpretability, paving the way for more effective decision making. Show more
Keywords: Machine learning, family identification, adversarial, black-box, surrogate models
DOI: 10.3233/IDA-230707
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-21, 2024
Authors: Liu, Zhao | Wang, Aimin | Bao, Haiming | Zhang, Kunpeng | Wu, Jing | Sun, Geng | Li, Jiahui
Article Type: Research Article
Abstract: The goal of feature selection in machine learning is to simultaneously maintain more classification accuracy, while reducing lager amount of attributes. In this paper, we firstly design a fitness function that achieves both objectives jointly. Then we come up with a chaos-based binary dragonfly algorithm (CBDA) that incorporates several improvements over the conventional dragonfly algorithm (DA) for developing a wrapper-based feature selection method to solve the fitness function. Specifically, the CBDA innovatively introduces three improved factors, namely the chaotic map, evolutionary population dynamics (EPD) mechanism, and binarization strategy on the basis of conventional DA to balance the exploitation and exploration …capabilities of the algorithm and make it more suitable to handle the formulated problem. We conduct experiments on 24 well-known data sets from the UCI repository with three ablated versions of CBDA targeting different components of the algorithm in order to explain their contributions in CBDA and also with five established comparative algorithms in terms of fitness value, classification accuracy, CPU running time, and number of selected features. The results show that the proposed CBDA has remarkable advantages in most of the tested data sets. Show more
Keywords: Feature selection, dragonfly algorithm, chaos, evolutionary population dynamics, classification accuracy
DOI: 10.3233/IDA-230540
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-36, 2024
Authors: Feng, Zhuo | Du, Yajun | Huang, Jiaming | Li, Xianyong | Chen, Xiaoliang | Xie, Chunzhi
Article Type: Research Article
Abstract: Large-scale studies indicate that the distinct approach to opinion fusion employed by extreme agents exerts a more potent influence on overall opinion evolution when compared to regular agents. The presence of extreme agents within the network tends to undermine the development of opinion neutrality, which is harmful to the guidance of online public opinion. Notably, prior research often overlooks the existence of opinion extreme agents in social networks. However, existing researches seldom consider the time sunk cost in the evolution of opinions. Building upon this foundation, we introduce a temporal dimension to the opinion evolution, integrating the time sunk cost …with the opinion evolution process. Furthermore, we devise an agent partitioning method that categorizes agents into four states based on their opinion values: watch state, subjective state, firm state, and extreme state, with extreme state agents generally expressing radical opinions. We constructed an agent network based on the phenomenon of time sunk costs and proposed a model for the evolution of extreme opinions in this network. Our study found that the information sharing among extreme agents significantly influences the extremization of opinions in various networks. After restricting the exchange of opinions on extreme agents, the number of extreme agents in the network decreased by 40% to 50% compared to the initial situation. Additionally, we also discovered that imposing restrictions on extreme agents in the early stages can help increase the possibility of network opinions moving towards neutral positions. When restriction of extreme agents(REA) was performed at the beginning of the experiment compared to REA in the midway of the experiment, the final number of extreme state agents decreased by 15.57%. The results show that extreme agents have a great influence on the spread and evolution of extreme opinions on platforms. Show more
Keywords: Time sunk costs, extremists, opinion dynamics, bounded confidence model, social networks, opinion evolution
DOI: 10.3233/IDA-230677
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-20, 2024
Authors: Zhang, Fei | Chan, Patrick P.K. | He, Zhi-Min | Yeung, Daniel S.
Article Type: Research Article
Abstract: A recommender system is susceptible to manipulation through the injection of carefully crafted profiles. Some recent profile identification methods only perform well in specific attack scenarios. A general attack detection method is usually complicated or requires label samples. Such methods are prone to overtraining easily, and the process of annotation incurs high expenses. This study proposes an unsupervised divide-and-conquer method aiming to identify attack profiles, utilizing a specifically designed model for each kind of shilling attack. Initially, our method categorizes the profile set into two attack types, namely Standard and Obfuscated Behavior Attacks. Subsequently, profiles are separated into clusters within …the extracted feature space based on the identified attack type. The selection of attack profiles is then determined through target item analysis within the suspected cluster. Notably, our method offers the advantage of requiring no prior knowledge or annotation. Furthermore, the precision is heightened as the identification method is designed to a specific attack type, employing a less complicated model. The outstanding performance of our model, validated through experimental results on MovieLens-100K and Netflix under various attack settings, demonstrates superior accuracy and reduced running time compared to current detection methods in identifying Standard and Obfuscated Behavior Attacks. Show more
Keywords: PCA, item popularity, shilling attack detection, divide-and-conquer method
DOI: 10.3233/IDA-230575
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-16, 2024
Authors: Tran, Le-Anh | Kwon, Daehyun | Deberneh, Henock Mamo | Park, Dong-Chul
Article Type: Research Article
Abstract: This paper proposes a data clustering algorithm that is inspired by the prominent convergence property of the Projection onto Convex Sets (POCS) method, termed the POCS-based clustering algorithm . For disjoint convex sets, the form of simultaneous projections of the POCS method can result in a minimum mean square error solution. Relying on this important property, the proposed POCS-based clustering algorithm treats each data point as a convex set and simultaneously projects the cluster prototypes onto respective member data points, the projections are convexly combined via adaptive weight values in order to minimize a predefined objective function for data …clustering purposes. The performance of the proposed POCS-based clustering algorithm has been verified through a large scale of experiments and data sets. The experimental results have shown that the proposed POCS-based algorithm is competitive in terms of both effectiveness and efficiency against some of the prevailing clustering approaches such as the K-Means/K-Means+ + and Fuzzy C-Means (FCM) algorithms. Based on extensive comparisons and analyses, we can confirm the validity of the proposed POCS-based clustering algorithm for practical purposes. Show more
Keywords: POCS, convex sets, clustering algorithm, unsupervised learning, machine learning
DOI: 10.3233/IDA-230655
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-18, 2024
Authors: Huang, Jiaming | Li, Xianyong | Li, Qizhi | Du, Yajun | Fan, Yongquan | Chen, Xiaoliang | Huang, Dong | Wang, Shumin
Article Type: Research Article
Abstract: Emojis in texts provide lots of additional information in sentiment analysis. Previous implicit sentiment analysis models have primarily treated emojis as unique tokens or deleted them directly, and thus have ignored the explicit sentiment information inside emojis. Considering the different relationships between emoji descriptions and texts, we propose a pre-training Bidirectional Encoder Representations from Transformers (BERT) with emojis (BEMOJI) for Chinese and English sentiment analysis. At the pre-training stage, we pre-train BEMOJI by predicting the emoji descriptions from the corresponding texts via prompt learning. At the fine-tuning stage, we propose a fusion layer to fuse text representations and emoji descriptions …into fused representations. These representations are used to predict text sentiment orientations. Experimental results show that BEMOJI gets the highest accuracy (91.41% and 93.36%), Macro-precision (91.30% and 92.85%), Macro-recall (90.66% and 93.65%) and Macro-F1-measure (90.95% and 93.15%) on the Chinese and English datasets. The performance of BEMOJI is 29.92% and 24.60% higher than emoji-based methods on average on Chinese and English datasets, respectively. Meanwhile, the performance of BEMOJI is 3.76% and 5.81% higher than transformer-based methods on average on Chinese and English datasets, respectively. The ablation study verifies that the emoji descriptions and fusion layer play a crucial role in BEMOJI. Besides, the robustness study illustrates that BEMOJI achieves comparable results with BERT on four sentiment analysis tasks without emojis, which means BEMOJI is a very robust model. Finally, the case study shows that BEMOJI can output more reasonable emojis than BERT. Show more
Keywords: Pre-trained language model, emoji sentiment analysis, implicit sentiment analysis, prompt learning, multi-feature fusion
DOI: 10.3233/IDA-230864
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024
Authors: Noronha, Marta D.M. | Zárate, Luis E.
Article Type: Research Article
Abstract: Characterizing longevity profiles from longitudinal studies is a task with many challenges. Firstly, the longitudinal databases usually have high dimensionality, and the similarities between long-lived and non-long-lived records are a highly burdening task for profile characterization. Addressing these issues, in this work, we use data from the English Longitudinal Study of Ageing (ELSA-UK) to characterize longevity profiles through data mining. We propose a method for feature engineering for reducing data dimensionality through merging techniques, factor analysis and biclustering. We apply biclustering to select relevant features discriminating both profiles. Two classification models, one based on a decision tree and the other …on a random forest, are built from the preprocessed dataset. Experiments show that our methodology can successfully discriminate longevity profiles. We identify insights into features contributing to individuals being long-lived or non-long-lived. According to the results presented by both models, the main factor that impacts longevity is related to the correlations between the economic situation and the mobility of the elderly. We suggest that this methodology can be applied to identify longevity profiles from other longitudinal studies since that factor is deemed relevant for profile classification. Show more
Keywords: Longitudinal data mining, human ageing, biclustering, factor analysis, classification
DOI: 10.3233/IDA-230314
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-24, 2024
Authors: Fan, Zeping | Zhang, Xuejun | Huang, Min | Bu, Zhaohui
Article Type: Research Article
Abstract: The Convolution-augmented Transformer (Conformer) model, which was recently introduced, has attained state-of-the-art(SOTA) results in Automatic Speech Recognition (ASR). In this paper, a series of methodical investigations uncover that the Conformer’s design decisions may not represent the most efficient choices when operating within the constraints of a limited computational budget. After a thorough re-evaluation of the Conformer architecture’s design choices, we propose Sampleformer which reduces the Conformer architecture complexity and has more robust performance. We introduce downsampling to the Conformer Encoder, and to exploit the information in the speech features, we incorporate an additional downsampling module to enhance the efficiency and …accuracy of our model. Additionally, we propose a novel and adaptable attention mechanism called multi-group attention, effectively reducing the attention complexity from O ( n 2 d ) to O ( n 2 d ⋅ f / g ) . We performed experiments on the AISHELL-1 corpora, our 13.3 million-parameter CTC model demonstrates a 3.0%/2.6% relative reduction in character error rate (CER) on the dev/test sets, all without the utilization of a language model (LM). Additionally, the model exhibits a 30% improvement in inference compared to our CTC Conformer baseline and trains 27% faster. Show more
Keywords: Speech recognition, conformer, attention mechanism, complexity reduction
DOI: 10.3233/IDA-230612
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-13, 2024
Authors: Liu, Xiaoyang | Wu, Yudie | Fiumara, Giacomo | De Meo, Pasquale
Article Type: Research Article
Abstract: Traditional community detection models either ignore the feature space information and require a large amount of domain knowledge to define the meta-paths manually, or fail to distinguish the importance of different meta-paths. To overcome these limitations, we propose a novel heterogeneous graph community detection method (called KGNN_HCD, heterogeneous graph Community Detection method based on K -nearest neighbor Graph Neural Network). Firstly, the similarity matrix is generated to construct the topological structure of K -nearest neighbor graph; secondly, the meta-path information matrix is generated using a meta-path transformation layer (Mp-Trans Layer) by adding weighted convolution; finally, a …graph convolutional network (GCN) is used to learn high-quality node representation, and the k -means algorithm is adopted on node embeddings to detect the community structure. We perform extensive experiments and on three heterogeneous datasets, ACM, DBLP and IMDB, and we consider as competitors 11 community detection methods such as CP-GNN and GTN. The experimental results show that the proposed KGNN_HCD method improves 2.54% and 2.56% on the ACM dataset, 2.59% and 1.47% on the DBLP dataset, and 1.22% and 1.67% on the IMDB dataset for both NMI and ARI. Experiments findings suggest that the proposed KGNN_HCD method is reasonable and effective, and KGNN_HCD can be applied to complex network classification and clustering tasks. Show more
Keywords: Heterogeneous graph, meta-path, K-nearest neighbor graph, graph neural network, community detection
DOI: 10.3233/IDA-230356
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-22, 2024
Authors: Yuan, Wei | Zhao, Shiyu | Wang, Li | Cai, Lijia | Zhang, Yong
Article Type: Research Article
Abstract: In the post-epidemic era, online learning has gained increasing attention due to the advancements in information and big data technology, leading to large-scale online course data with various student behaviors. Online data mining has become a popular and important way of extracting valuable insights from large amounts of data. However, previous online course analysis methods often focused on individual aspects of the data and neglected the correlation among the large-scale learning behavior data, which can lead to an incomplete understanding of the overall learning behavior and patterns within the online course. To solve the problems, this paper proposes an online …course evaluation model based on a graph auto-encoder. In our method, the features of collected online course data are used to construct K-Nearest Neighbor(KNN) graphs to represent the association among the courses. Then the variational graph auto-encoder(VGAE) is introduced to learn the useful implicit features. Finally, we feed the learned implicit features into unsupervised and semi-supervised downstream tasks for online course evaluation, respectively. We conduct experiments on two datasets. In the clustering task, our method showed a more than tenfold increase in the Calinski-Harabasz index compared to unoptimized features, demonstrating significant structural distinction and group coherence. In the classification task, compared to traditional methods, our model exhibited an overall performance improvement of about 10%, indicating its effectiveness in handling complex network data. Show more
Keywords: Educational data mining, online course evaluation, deep learning, graph auto-encoder
DOI: 10.3233/IDA-230557
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-23, 2024
Authors: K, Subha | N, Bharathi
Article Type: Research Article
Abstract: In today’s digital era, the generation and sharing of information are rapidly expanding. The increased volume of complex data is big data. YouTube is the primary source of big data. The proliferation of the internet and smart devices has led to a significant increase in content creators on social media platforms, with YouTube being a prominent example. There has been a substantial increase in content creators across various social media platforms, with YouTube emerging as one of the foremost platforms for content generation and sharing. YouTubers face challenges in enhancing content strategies due to the growing number of comments, such …as big data on shared videos. Reading and finding viewers’ opinions of such a large amount of data through manual methods is time-consuming and challenging and makes it hard to understand people’s sentiments. To address this, spark-based machine learning algorithms have emerged as a transformative tool for content creators to understand the audience. The Improved Novel Ensemble Method (INEM) algorithm is designed to predict viewers’ sentiments and emotional responses based on the content they interact through the comments. The proposed results provide valuable insights for content creators, helping them refine the strategies to optimize the channel’s revenue and performance. Fit Tuber Channel is analyzed to perform the sentiment of user comments. Show more
Keywords: Big data, sentiment analysis, machine learning, social-media, spark
DOI: 10.3233/IDA-240198
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-11, 2024
Authors: Gupta, Ayushi | Chug, Anuradha | Singh, Amit Prakash
Article Type: Research Article
Abstract: PURPOSE: Crop diseases can cause significant reductions in yield, subsequently impacting a country’s economy. The current research is concentrated on detecting diseases in three specific crops – tomatoes, soybeans, and mushrooms, using a real-time dataset collected for tomatoes and two publicly accessible datasets for the other crops. The primary emphasis is on employing datasets with exclusively categorical attributes, which poses a notable challenge to the research community. METHODS: After applying label encoding to the attributes, the datasets undergo four distinct preprocessing techniques to address missing values. Following this, the SMOTE-N technique is employed to tackle class …imbalance. Subsequently, the pre-processed datasets are subjected to classification using three ensemble methods: bagging, boosting, and voting. To further refine the classification process, the metaheuristic Ant Lion Optimizer (ALO) is utilized for hyper-parameter tuning. RESULTS: This comprehensive approach results in the evaluation of twelve distinct models. The top two performers are then subjected to further validation using ten standard categorical datasets. The findings demonstrate that the hybrid model II-SN-OXGB, surpasses all other models as well as the current state-of-the-art in terms of classification accuracy across all thirteen categorical datasets. II utilizes the Random Forest classifier to iteratively impute missing feature values, employing a nearest features strategy. Meanwhile, SMOTE-N (SN) serves as an oversampling technique particularly for categorical attributes, again utilizing nearest neighbors. Optimized (using ALO) Xtreme Gradient Boosting OXGB, sequentially trains multiple decision trees, with each tree correcting errors from its predecessor. CONCLUSION: Consequently, the model II-SN-OXGB emerges as the optimal choice for addressing classification challenges in categorical datasets. Applying the II-SN-OXGB model to crop datasets can significantly enhance disease detection which in turn, enables the farmers to take timely and appropriate measures to prevent yield losses and mitigate the economic impact of crop diseases. Show more
Keywords: Categorical data, ensemble methods, missing values imputation, metaheuristic optimization, plant disease
DOI: 10.3233/IDA-230651
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-25, 2024
Authors: Abidi, Mustufa Haider | Khare, Neelu | D., Preethi | Alkhalefah, Hisham | Umer, Usama
Article Type: Research Article
Abstract: The emergence of the novel COVID-19 virus has had a profound impact on global healthcare systems and economies, underscoring the imperative need for the development of precise and expeditious diagnostic tools. Machine learning techniques have emerged as a promising avenue for augmenting the capabilities of medical professionals in disease diagnosis and classification. In this research, the EFS-XGBoost classifier model, a robust approach for the classification of patients afflicted with COVID-19 is proposed. The key innovation in the proposed model lies in the Ensemble-based Feature Selection (EFS) strategy, which enables the judicious selection of relevant features from the expansive COVID-19 dataset. …Subsequently, the power of the eXtreme Gradient Boosting (XGBoost) classifier to make precise distinctions among COVID-19-infected patients is harnessed.The EFS methodology amalgamates five distinctive feature selection techniques, encompassing correlation-based, chi-squared, information gain, symmetric uncertainty-based, and gain ratio approaches. To evaluate the effectiveness of the model, comprehensive experiments were conducted using a COVID-19 dataset procured from Kaggle, and the implementation was executed using Python programming. The performance of the proposed EFS-XGBoost model was gauged by employing well-established metrics that measure classification accuracy, including accuracy, precision, recall, and the F1-Score. Furthermore, an in-depth comparative analysis was conducted by considering the performance of the XGBoost classifier under various scenarios: employing all features within the dataset without any feature selection technique, and utilizing each feature selection technique in isolation. The meticulous evaluation reveals that the proposed EFS-XGBoost model excels in performance, achieving an astounding accuracy rate of 99.8%, surpassing the efficacy of other prevailing feature selection techniques. This research not only advances the field of COVID-19 patient classification but also underscores the potency of ensemble-based feature selection in conjunction with the XGBoost classifier as a formidable tool in the realm of medical diagnosis and classification. Show more
Keywords: COVID-19, machine learning, classification, ensemble-based feature selection, XGBoost
DOI: 10.3233/IDA-230854
Citation: Intelligent Data Analysis, vol. Pre-press, no. Pre-press, pp. 1-18, 2024
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com
For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn
For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl
如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl