The journal Data Science is an interdisciplinary journal that aims to publish novel and effective methods on using scientific data in a principled, well-defined, and reproducible fashion, concrete tools that are based on these methods, and applications thereof. The ultimate goal is to unleash the power of scientific data to deepen our understanding of physical, biological, and digital systems, gain insight into human social and economic behavior, and design new solutions for the future. The rising importance of scientific data, both big and small, brings with it a wealth of challenges to combine structured, but often siloed data with messy, incomplete, and unstructured data from text, audio, visual content such as sensor and weblog data. New methods to extract, transport, pool, refine, store, analyze, and visualize data are needed to unleash their power while simultaneously making tools and workflows easier to use by the public at large. The journal invites contributions ranging from theoretical and foundational research, platforms, methods, applications, and tools in all areas. We welcome papers which add a social, geographical, and temporal dimension to Data Science research, as well as application-oriented papers that prepare and use data in discovery research.
This journal focuses on methods, infrastructure, and applications around the following core topics:
- scientific data mining, machine learning, and Big Data analytics
- data management, network analysis, and scientific knowledge discovery
- scholarly communication and (semantic) publishing
- research data publication, indexing, quality, and discovery
- data wrangling, integration, provenance
- trend analysis, prediction, and visualization
- crowdsourcing and collaboration
- corroboration, validation, trust, and reproducibility
- scalable computing, analysis, and learning
- smart and semantic web services, executable workflows
- analytics, intelligence, and real time decision making
- socio-technical systems
- social impacts of data science
Semantic publishing has been defined as anything that enhances the meaning of a published journal article, facilitates its automated discovery, enables its linking to semantically related articles, provides access to data within the article in actionable form, or facilitates integration of data between papers. Towards the goal of genuine semantic publishing, where a work may be published with its content and metadata represented in a machine-interpretable semantic notation, this journal will work with a global set of partners to develop standardized methods to ensure that our publications can be seen as a machine-accessible store of knowledge.
An important goal of the journal is to promote an environment to produce and share annotated data to the wider research community. The development and use of data and metadata standards are critical for achieving this goal. Authors should ensure that any data used or produced in the study is represented with community-based data formats and metadata standards.
Rapid, Open, Transparent, and Attributed Reviews
Data Science journal relies on an open and transparent review process. Submitted manuscripts are posted on the journal’s website and are publicly available. In addition to solicited reviews selected by members of the editorial board, public reviews and comments are welcome by any researcher and can be uploaded using the journal website. All reviews and responses from the authors are posted on the journal homepage. All involved reviewers and editors will be acknowledged in the final printed version. While we strongly encourage reviewers to participate in the open and transparent review process, it is still possible to submit anonymous reviews. Editors, non-anonymous reviewers will be included in all published articles. The journal will aim to complete reviews within 2-4 weeks of submission.
The journal will provide editor and reviewer profiles and metrics (links to ORCID, Google Scholar, etc.).
Abstract: Capturing data in the form of networks is becoming an increasingly popular approach for modeling, analyzing and visualising complex phenomena, to understand the important properties of the underlying complex processes. Access to many large-scale network datasets is restricted due to the privacy and security concerns. Also for several applications (such as functional connectivity networks), generating large scale real data is expensive. For these reasons, there is a growing need for advanced mathematical and statistical models (also called generative models) that can account for the structure of these large-scale networks, without having to materialize them in the real world. The objective…is to provide a comprehensible description of the network properties and to be able to infer previously unobserved properties. Various models have been developed by researchers, which generate synthetic networks that adhere to the structural properties of real networks. However, the selection of the appropriate generative model for a given real-world network remains an important challenge. In this paper, we investigate this problem and provide a novel technique (named as TripletFit) for model selection (or network classification) and estimation of structural similarities of the complex networks. The goal of network model selection is to select a generative model that is able to generate a structurally similar synthetic network for a given real-world (target) network. We consider six outstanding generative models as the candidate models. The existing model selection methods mostly suffer from sensitivity to network perturbations, dependency on the size of the networks, and low accuracy. To overcome these limitations, we considered a broad array of network features, with the aim of representing different structural aspects of the network and employed deep learning techniques such as deep triplet network architecture and simple feed-forward network for model selection and estimation of structural similarities of the complex networks. Our proposed method, outperforms existing methods with respect to accuracy, noise-tolerance, and size independence on a number of gold standard data set used in previous studies.
Keywords: Complex networks, deep learning, generative models, model selection
Abstract: Experimenting with different models, documenting results and findings, and repeating these tasks are day-to-day activities for machine learning engineers and data scientists. There is a need to keep control of the machine-learning pipeline and its metadata. This allows users to iterate quickly through experiments and retrieve key findings and observations from historical activity. This is the need that Arangopipe serves. Arangopipe is an open-source tool that provides a data model that captures the essential components of any machine learning life cycle. Arangopipe provides an application programming interface that permits machine-learning engineers to record the details of the salient steps in…building their machine learning models. The components of the data model and an overview of the application programming interface is provided. Illustrative examples of basic and advanced machine learning workflows are provided. Arangopipe is not only useful for users involved in developing machine learning models but also useful for users deploying and maintaining them.
Keywords: Machine learning pipelines, reproducibility, data lineage, machine learning meta-data
Abstract: The General Data Protection Regulation (GDPR) grants all natural persons the right to access their personal data if this is being processed by data controllers. The data controllers are obliged to share the data in an electronic format and often provide the data in a so called Data Download Package (DDP). These DDPs contain all data collected by public and private entities during the course of a citizens’ digital life and form a treasure trove for social scientists. However, the data can be deeply private. To protect the privacy of research participants while using their DDPs for scientific research, we…developed a de-identification algorithm that is able to handle typical characteristics of DDPs. These include regularly changing file structures, visual and textual content, differing file formats, differing file structures and private information like usernames. We investigate the performance of the algorithm and illustrate how the algorithm can be tailored towards specific DDP structures.
Keywords: Data Download Package, Instagram, de-identification, anonymization, pseudonymization
Abstract: Combining and analysing sensitive data from multiple sources offers considerable potential for knowledge discovery. However, there are a number of issues that pose problems for such analyses, including technical barriers, privacy restrictions, security concerns, and trust issues. Privacy-preserving distributed data mining techniques (PPDDM) aim to overcome these challenges by extracting knowledge from partitioned data while minimizing the release of sensitive information. This paper reports the results and findings of a systematic review of PPDDM techniques from 231 scientific articles published in the past 20 years. We summarize the state of the art, compare the problems they address, and identify the…outstanding challenges in the field. This review identifies the consequence of the lack of standard criteria to evaluate new PPDDM methods and proposes comprehensive evaluation criteria with 10 key factors. We discuss the ambiguous definitions of privacy and confusion between privacy and security in the field, and provide suggestions of how to make a clear and applicable privacy description for new PPDDM techniques. The findings from our review enhance the understanding of the challenges of applying theoretical PPDDM methods to real-life use cases, and the importance of involving legal-ethical and social experts in implementing PPDDM methods. This comprehensive review will serve as a helpful guide to past research and future opportunities in the area of PPDDM.
Keywords: Survey, data mining, privacy preserving, distributed learning