Education & Training

  • Ph.D. On going

    Ph.D. in Computing Science

    Newcastle University School of Computing Science, UK

  • MSc2012

    Master Degree in Computer Engeneering

    Universita' degli Studi di Padova, Italy

  • BSc2009

    Bachelor Degree in Computer Engeneering

    Universita' degli Studi di Padova, Italy

Filter by type:

Sort by year:

Heterogeneous Ensembles for the Missing Feature Problem

L. Nanni, S. Brahnam, C. Fantozzi and N. Lazzarini
Conference Papers2013 Annual Meeting of the Northeast Decision Sciences Institute, Pages 523-535

Abstract

Missing values are ubiquitous in real-world datasets. In this work, we show how to handle them with heterogeneous ensembles of classifiers that outperform state-of-the-art solutions. Several approaches are compared using several different datasets. Some state-of-the-art classifiers (e.g. SVM and RotBoost) are first tested, coupled with the Expectation-Maximization (EM) imputation method. Then, the classifiers are combined to build ensembles. Using the Wilcoxon signed-rank test (reject the null hypothesis, level of significance 0.05) we show that our best heterogeneous ensemble, obtained by combining a forest of decision trees (a method that does not require any dataset-specific tuning) with a cluster-based imputation method, outperforms two dataset-tuned solutions based on LibSVM, the most used SVM toolbox in the world: a stand-alone SVM classifier and a random subspace of SVMs. The same ensemble also exhibit better performance than a recent cluster-based imputation method for handling missing values – which has been shown to outperform several state-of-the-art imputation approaches – when both the training set and the test set contain 10% or 25% of missing values. MATLAB code of several tested descriptors and tested datasets are available at http://www.dei.unipd.it/wdyn/?IDsezione=3314&IDgruppo_pass=124&preview=

Heterogeneous Machine Learning System for Diagnosing Primary Aldosteronism

N. Lazzarini, L. Nanni, C. Fantozzi, A. Pietracaprina, G. Pucci, G.P. Rossi and T. Seccia
Journal Paper Journal of Hypertension 31, June 2013, Pages e409.

Abstract

The identification of Primary Aldosteronism (PA) is challenging in that this common cause of curable hypertension often mimics primary (essential) hypertension. Hence, we aimed at developing an improved machine learning method for the classification of PA. Since missing values and collinearity are common problems in the real-world data, the new method was required to deal with such issues.

Coupling different methods for overcoming the class imbalance problem.

L. Nanni, C. Fantozzi and N. Lazzarini
Journal Paper Neurocomputing, Volume 158, June 2015, Pages 48-66

Abstract

Many classification problems must deal with imbalanced datasets where one class – the majority class – outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature..

Hard Data Analytics Problems Make for Better Data Analysis Algorithms: Bioinformatics as an Example

J. Bacardit, P. Widera, N. Lazzarini and N. Krasnogor.
Journal Paper Big Data, Volume 2, Issue 3, September 2014, Pages 164-176

Abstract

Data mining and knowledge discovery techniques have greatly progressed in the last decade. They are now able to handle larger and larger datasets, process heterogeneous information, integrate complex metadata, and extract and visualize new knowledge. Often these advances were driven by new challenges arising from real-world domains, with biology and biotechnology a prime source of diverse and hard (e.g., high volume, high throughput, high variety, and high noise) data analytics problems. The aim of this article is to show the broad spectrum of data mining tasks and challenges present in biological data, and how these challenges have driven us over the years to design new data mining and knowledge discovery procedures for biodata. This is illustrated with the help of two kinds of case studies. The first kind is focused on the field of protein structure prediction, where we have contributed in several areas: by designing, through regression, functions that can distinguish between good and bad models of a protein's predicted structure; by creating new measures to characterize aspects of a protein's structure associated with individual positions in a protein's sequence, measures containing information that might be useful for protein structure prediction; and by creating accurate estimators of these structural aspects. The second kind of case study is focused on omics data analytics, a class of biological data characterized for having extremely high dimensionalities. Our methods were able not only to generate very accurate classification models, but also to discover new biological knowledge that was later ratified by experimentalists. Finally, we describe several strategies to tightly integrate knowledge extraction and data mining in order to create a new class of biodata mining algorithms that can natively embrace the complexity of biological data, efficiently generate accurate information in the form of classification/regression models, and extract valuable new knowledge. Thus, a complete data-to-information-to-knowledge pipeline is presented.

Heterogeneous machine learning system for improving the diagnosis of primary aldosteronism

N. Lazzarini, L. Nanni, C. Fantozzi, A. Pietracaprina, G. Pucci, M.T. Seccia and G.P Rossi.
Journal Paper Pattern Recognition Letter, July 2015

Abstract

We develop a novel classifier for the diagnosis of Aldosterone-Producing Adenoma (APA), which induces Primary Aldosteronism, the most common endocrine cause of curable hypertension. The classifier considerably improves upon the state of the art, and it is devised and tested on a large dataset of patients, each described by several demographic and biochemical features. As customary in real-world datasets, ours is affected by feature correlation, missing values, and class imbalance. We make explicit provisions for dealing with all of these issues through an ensemble of ensembles, that is, a multilevel fusion of different component classifiers. Using the Wilcoxon signed-rank test at 0.05 significance level, we show that our ensemble significantly outperforms the state-of-the-art classifier and any individual component in the ensemble. Our experiments employ a “leave-one-out-clinical” cross validation as patients were treated in 15 different specialized centers for hypertension; in each fold, 14 centers are used for training and 1 as the test set. Our classifier is available at http://www.dei.unipd.it/node/2357 (MATLAB code).

Functional networks inference from rule-based machine learning models

N. Lazzarini, P. Widera, S. Williamson, R. Heer, N. Krasnogor and J. Bacardit
Journal Paper BioData Mining, Volume 9, Issue 28, September 2016

Abstract

Background Functional networks play an important role in the analysis of biological processes and systems. The inference of these networks from high-throughput (-omics) data is an area of intense research. So far, the similarity-based inference paradigm (e.g. gene co-expression) has been the most popular approach. It assumes a functional relationship between genes which are expressed at similar levels across different samples. An alternative to this paradigm is the inference of relationships from the structure of machine learning models. These models are able to capture complex relationships between variables, that often are different/complementary to the similarity-based methods.
Results We propose a protocol to infer functional networks from machine learning models, called FuNeL. It assumes, that genes used together within a rule-based machine learning model to classify the samples, might also be functionally related at a biological level. The protocol is first tested on synthetic datasets and then evaluated on a test suite of 8 real-world datasets related to human cancer. The networks inferred from the real-world data are compared against gene co-expression networks of equal size, generated with 3 different methods. The comparison is performed from two different points of view. We analyse the enriched biological terms in the set of network nodes and the relationships between known disease-associated genes in a context of the network topology. The comparison confirms both the biological relevance and the complementary character of the knowledge captured by the FuNeL networks in relation to similarity-based methods and demonstrates its potential to identify known disease associations as core elements of the network. Finally, using a prostate cancer dataset as a case study, we confirm that the biological knowledge captured by our method is relevant to the disease and consistent with the specialised literature and with an independent dataset not used in the inference process.

Prediction of human population responses to toxic compounds by a collaborative competition

F. Eduati, L. Mangiavite, T. Wang, H.Tang, J.C. Bare, R. Huang, T. Norman, M. Kellen, M.P. Menden, J. Yang, X. Zhan, R. Zhong, G. Xiao, M. Xia, N. Abdo, O. Kosyk, F. Eduati, S. Friend, G. Stolovitzky, A. Dearryet et al.
Journal Paper Nature Biotechnology, August 2015

Abstract

The ability to computationally predict the effects of toxic compounds on humans could help address the deficiencies of current chemical safety testing. Here, we report the results from a community-based DREAM challenge to predict toxicities of environmental compounds with potential adverse health effects for human populations. We measured the cytotoxicity of 156 compounds in 884 lymphoblastoid cell lines for which genotype and transcriptional data are available as part of the Tox21 1000 Genomes Project. The challenge participants developed algorithms to predict interindividual variability of toxic response from genomic profiles and population-level cytotoxicity data from structural attributes of the compounds. 179 submitted predictions were evaluated against an experimental data set to which participants were blinded. Individual cytotoxicity predictions were better than random, with modest correlations (Pearson's r < 0.28), consistent with complex trait genomic prediction. In contrast, predictions of population-level response to different compounds were higher (r < 0.66). The results highlight the possibility of predicting health risks associated with unknown compounds, although risk estimation accuracy remains suboptimal.

Currrent Teaching

  • Present 1995

    Preclinical Endodnotics

    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ultrices ac elit sit amet porttitor. Suspendisse congue, erat vulputate pharetra mollis, est eros fermentum nibh, vitae rhoncus est arcu vitae elit.

  • Present 2003

    SELC 8160 Molar Endodontic Selective

    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ultrices ac elit sit amet porttitor. Suspendisse congue, erat vulputate pharetra mollis, est eros fermentum nibh, vitae rhoncus est arcu vitae elit.

  • Present 2010

    Endodontics Postdoctoral AEGD Program

    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ultrices ac elit sit amet porttitor. Suspendisse congue, erat vulputate pharetra mollis, est eros fermentum nibh, vitae rhoncus est arcu vitae elit.

Teaching History

  • 1997 1995

    Preclinical Endodnotics

    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ultrices ac elit sit amet porttitor. Suspendisse congue, erat vulputate pharetra mollis, est eros fermentum nibh, vitae rhoncus est arcu vitae elit.

  • 2005 2003

    SELC 8160 Molar Endodontic Selective

    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ultrices ac elit sit amet porttitor. Suspendisse congue, erat vulputate pharetra mollis, est eros fermentum nibh, vitae rhoncus est arcu vitae elit.

  • 2011 2010

    Endodontics Postdoctoral AEGD Program

    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ultrices ac elit sit amet porttitor. Suspendisse congue, erat vulputate pharetra mollis, est eros fermentum nibh, vitae rhoncus est arcu vitae elit.

  • 2011 2010

    Endodontics Postdoctoral AEGD Program

    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ultrices ac elit sit amet porttitor. Suspendisse congue, erat vulputate pharetra mollis, est eros fermentum nibh, vitae rhoncus est arcu vitae elit.

  • 2011 2010

    Endodontics Postdoctoral AEGD Program

    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ultrices ac elit sit amet porttitor. Suspendisse congue, erat vulputate pharetra mollis, est eros fermentum nibh, vitae rhoncus est arcu vitae elit.