In the field of Immunology we are just beginning to explore repurposing public datasets to build our knowledge base, gain insight into new discoveries, and generate data-driven hypotheses that were not originally formulated in the studies. With increasing awareness of the importance of sharing research data and findings, however, comes the additional need to showcase how best to leverage shared datasets across research domains. Through this coursework we will showcase major efforts in the meta-analysis of open immunological data. Participants will gain clear understanding of recent trends in conducting data-driven science.
No strict requirements of prior knowledge; perhaps some familiarity with clinical trials, high-throughput technologies and interest in ‘big data’ analysis.
Sanchita Bhattacharya is a Bioinformatics project team leader at Bakar Institute of Computational Health Sciences in University of California, San Francisco and scientific program director for ImmPort, a National Institute of Allergy and Infectious Diseases Division of Allergy, Immunology, and Transplantation (NIAID-DAIT) sponsored shared data repository for subject-level human immunology study data and protocols. She is currently leading the efforts to leverage open-access clinical trials and immunology research studies. Her projects involve “The 10,000 Immunomes Project”, a diverse human immunology reference derived from over 44,000 individuals across 291 studies. Sanchita comes with twenty years of work experience as a data scientist at various academic institutions such as Stanford School of Medicine, Lawrence Berkeley National Laboratory, and MIT. Her formal training in bioinformatics coupled with expertise in computational modeling and immunology has led to a number of publications demonstrating the repurposing of big data in Immunology and other research areas to facilitate translational research. Recently, her team has embarked on a deep learning project to better understand the high-throughput single-cell cytometry data using convolutional neural network algorithm for applications in clinical immunology.
Learning from imbalanced data is pervasive across applications, as the class(es) of interest do not have as many instances and this under-representation presents a challenge from learning to evaluation. The standard learning algorithms can be biased towards the larger classes, and thus under-perform on the actual class of interest, leading to inferior performance. For example, consider a fraud detection example where the non-fraudulent instances make-up 99% of the data and fraudulent instances make up 1% of the data. When a classifier is presented with such a skewed data distribution, it tends to favor the larger class and biased towards making non-fraud predictions at the expense of predictions of fraud, when the goal is to actually catch fraudulent activity. This also presents the challenge of accuracy as a metric, as just random guessing non-fraud class will produce a 99% accurate classifier. This problem is further confounded in the presence of streaming data associated with changing distributions or concept drift. In this tutorial, I will provide an introduction to learning from imbalanced data and concept drift, provide an overview of popular methods from sampling to learning algorithms about class imbalance as well as class imbalance in the presence of concept drift, how to effectively evaluate the performance of learning algorithms, the challenges associated with big data, and the interface of deep learning and class imbalance. I will also include a perspective of different applications.
Introductory course in machine learning or data science or data mining
Nitesh Chawla is the Frank M. Freimann Professor of Computer Science and Engineering, and Director of the Center on Network and DataSsciences (CNDS) at the University of Notre Dame. His research is focused on machine learning, AI and network science fundamentals and interdisciplinary applications that advance the common good. He is also the recipient of several awards and honors including the Best Paper Award and Nominations, Outstanding Dissertation Award, NIPS Classification Challenge Award, IEEE CIS Outstanding Early Career Award, the IBM Watson Faculty Award, the IBM Big Data and Analytics Faculty Award, the National Academy of Engineering New Faculty Fellowship, and 1st Source Bank Technology Commercialization Award. In recognition of the societal impact of his research, he was recognized with the Rodney F. Ganey Award. He is co-founder of Aunalytics, a data science software and solutions company.
The rise of Bitcoin and other peer-to-peer cryptocurrencies has opened many interesting and challenging problems in cryptography, distributed systems, and databases. The main underlying data structure is blockchain, a scalable fully replicated structure that is shared among all participants and guarantees a consistent view of all user transactions by all participants in the system. In this course, we discuss the basic protocols used in blockchain, and elaborate on its main advantages and limitations. To overcome these limitations, we provide the necessary distributed systems background in managing large scale fully replicated ledgers, using Byzantine Agreement protocols to solve the consensus problem. Finally, we expound on some of the most recent proposals to design scalable and efficient blockchains in both permissionless and permissioned settings. The focus of the tutorial is on the distributed systems and database aspects of the recent innovations in blockchains.
Basic knowledge of data structures and operating systems.
Amr El Abbadi is a Professor of Computer Science at the University of California, Santa Barbara. He received his B. Eng. from Alexandria University, Egypt, and his Ph.D. from Cornell University. His research interests are in the fields of fault-tolerant distributed systems and databases, focusing recently on Cloud data management and blockchain based systems. Prof. El Abbadi is an ACM Fellow, AAAS Fellow, and IEEE Fellow. He was Chair of the Computer Science Department at UCSB from 2007 to 2011. He has served as a journal editor for several database journals, including, The VLDB Journal, IEEE Transactions on Computers and The Computer Journal. He has been Program Chair for multiple database and distributed systems conferences. He currently serves on the executive committee of the IEEE Technical Committee on Data Engineering (TCDE) and was a board member of the VLDB Endowment from 2002 to 2008. In 2007, Prof. El Abbadi received the UCSB Senate Outstanding Mentorship Award for his excellence in mentoring graduate students. In 2013, his student, Sudipto Das received the SIGMOD Jim Gray Doctoral Dissertation Award. Prof. El Abbadi is also a co-recipient of the Test of Time Award at EDBT/ICDT 2015. He has published over 300 articles in databases and distributed systems and has supervised over 35 PhD students.
These three lectures will explain the essential theory and practice of neural networks as used in current applications.
The lectures will be less mathematical than https://mitpress.mit.edu/books/deep-learning but more mathematical than https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/. Both these books are highly recommended.
Mathematics at the level of an undergraduate degree in computer science: basic multivariate calculus, probability theory, and linear algebra.
Charles Elkan is the global head of machine learning and a managing director at Goldman Sachs in New York. He is also an adjunct professor of computer science at the University of California, San Diego (UCSD). From 2014 to 2018 he was the first Amazon Fellow, leading a team of over 30 scientists and engineers in Seattle, Palo Alto, and New York doing research and development in applied machine learning in both e-commerce and cloud computing. Before joining Amazon, he was a full-time professor of computer science at UCSD. His Ph.D. is from Cornell in computer science, and his undergraduate degree is from Cambridge in mathematics. For publications, see https://scholar.google.com/citations?user=im5aMngAAAAJ&hl=en
The real-world big data are largely unstructured, interconnected, and dynamic, in the form of natural language text. It is highly desirable to transform such massive unstructured data into structured knowledge. Many researchers rely on labor-intensive labeling and curation to extract knowledge from such data. However, such approaches may not be scalable, especially considering that a lot of text corpora are highly dynamic and domain-specific. We argue that massive text data itself may disclose a large body of hidden patterns, structures, and knowledge. Equipped with domain-independent and domain-dependent knowledge-bases, we should explore the power of massive data itself for turning unstructured data into structured knowledge. Moreover, by organizing massive text documents into multidimensional text cubes, we show structured knowledge can be extracted and used effectively. In this talk, we introduce a set of methods developed recently in our group for such an exploration, including mining quality phrases, entity recognition and typing, multi-faceted taxonomy construction, and construction and exploration of multi-dimensional text cubes. We show that data-driven approach could be an promising direction at transforming massive text data into structured knowledge.
• Familiarity with elementary knowledge about machine learning, data mining and natural language processing.
Jiawei Han is Michael Aiken Chair Professor in the Department of Computer Science, University of Illinois at Urbana-Champaign. He has been researching into data mining, information network analysis, database systems, and data warehousing, with over 900 journal and conference publications. He has chaired or served on many program committees of international conferences in most data mining and database conferences. He also served as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data (2008-2011), the Director of Information Network Academic Research Center supported by U.S. Army Research Lab (2009-2016), and the co-Director of KnowEnG, an NIH funded Center of Excellence in Big Data Computing since 2014. He is a Fellow of ACM and a Fellow of IEEE. He received ACM SIGKDD Innovations Award (2004), IEEE Computer Society Technical Achievement Award (2005), IEEE W. Wallace McDowell Award (2009), Japan's Funai Achievement Award (2018), and have been named as Michael Aiken Chair Professor at the University of Illinois (2019).
Microorganisms are abundant in nature and play a vital role in ecosystems. The human body, as well as one of the hosts, process thousands of microbial populations. In the past few years, the biological sequencing technology and experiments have been conducting. A mass of high-throughput sequencing data is the challenge for identifying the crucial role of microbes in their hosts. In this course, we will introduce the background of microbiome and its implications. Then we will introduce several aspect of big microbiome data mining and analysis. Firstly, we will talk about computational method for visualizing microbiome data, for example, some dimension reduction and visualization method including t-SNE and principle coordinate analysis. Furthermore, we will discuss computational methods for the classification of microbiome samples. These include nonnegative matrix factorization, random forest and other deep learning methods, as well as the feature selection and biomarker discovery for microbe relate diseases. We will also discuss some computational technologies for construct and analyzing microbial interactions from microbial profile data.
Included in the ppts at the lecture
Xiaohua Tony Hu (Ph.D, 1995) is a full professor and the founding director of the data mining and bioinformatics lab at the College of Computing and Informatics (the former College of Information Science and Technology, one of the best information science schools in USA, ranked as #1 in 1999 and #6 in 2010 in information systems by U.S. News & World Report). He is also serving as the founding Co-Director of the NSF Center (I/U CRC) on Visual and Decision Informatics (NSF CVDI), IEEE Computer Society Bioinformatics and Biomedicine Steering Committee Chair, and IEEE Computer Society Big Data Steering Committee Chair. Tony is a scientist, teacher and entrepreneur. He joined Drexel University in 2002. He founded the International Journal of Data Mining and Bioinformatics (SCI indexed) in 2006. Earlier, he worked as a research scientist in the world-leading R&D centers such as Nortel Research Center, and Verizon Lab (the former GTE labs). In 2001, he founded the DMW Software in Silicon Valley, California. He has a lot of experience and expertise to convert original ideas into research prototypes, and eventually into commercial products, many of his research ideas have been integrated into commercial products and applications in data mining fraud detection, database marketing.
Tony’s current research interests are in big data, data/text/web mining, bioinformatics, information retrieval and information extraction, social network analysis, healthcare informatics. He has published more than 280 peer-reviewed research papers (google citation more than 20000) in various journals, conferences and books such as various IEEE/ACM Transactions (IEEE/ACM TCBB, IEEE TFS, IEEE TDKE, IEEE TITB, IEEE SMC, IEEE Computer, IEEE NanoBioScience, IEEE Intelligent Systems), JIS, KAIS, CI, DKE, IJBRA, SIG KDD, IEEE ICDM, IEEE ICDE, SIGIR, ACM CIKM, IEEE BIBE, IEEE CICBC etc, co-edited 20 books/proceedings. He has received a few prestigious awards including the 2005 National Science Foundation (NSF) Career award, the best paper award at the 2007 International Conference on Artificial Intelligence, the best paper award at the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, the 2010 IEEE Granular Computing Outstanding Contribution Awards, the 2007 IEEE Bioinformatics and Bioengineering Outstanding Contribution Award, the 2006 IEEE Granular Computing Outstanding Service Award, and the 2001 IEEE Data Mining Outstanding Service Award. He has also served as a program co-chair/conference co-chair of 14 international conferences/workshops and a program committee member in more than 80 international conferences in the above areas. He is the founding editor-in-chief of the International Journal of Data Mining and Bioinformatics (SCI indexed), International Journal of Granular Computing, Rough Sets and Intelligent Systems, an associate editor/editorial board member of four international journals (KAIS, IJDWM, IJSOI and JCIB). His research projects are funded by the National Science Foundation (NSF), US Dept. of Education, the PA Dept. of Health and Industries Labs. He has obtained more than US$9.0 million research grants in the past 12 years as PI or Co-PI (PIs of 9 NSF grants, PI of 1 IMLS grant in the last 10 years). He has graduated 24 Ph.D. students from 2006 to 2019 and is currently supervising 6 Ph.D. students.
Tony is the founding Co-Director of the NSF Center for Visual and Decision Informatics (NSF CVDI), there are about 60 such centers nationwide supported by the NSF Industry/University Cooperative Research Center program in all the research disciplines covered by NSF. The CVDI is the “National Center of Excellence” to deal with the Big Data Challenges. The center is funded by NSF, the members from industry and government, and university matching funds. The current industry members and government agencies associated with Drexel University are: Children Hospital of Philadelphia, Elsevier, Institute of Museum and Library, Johnson & Johnson, Microsoft Research, Penn Dept. of Health, Thomson Reuters, SunGuard LLP, Lockheed Martin, IMS Healthcare and SOI. The CVDI serves to drive continuous innovation through knowledge sharing among partners leading to invention and commercialization of information and knowledge engineering technologies for decision support, research and develop next generation data mining, visual and decision support tools & techniques to enable decision makers in government and industry to fundamentally improve the way their organization’s information is interpreted and analyzed. Currently, there are about 30 faculty and staff from Drexel University and University of Louisiana at Lafayette, and 12 industry companies/government agencies participated in various research projects.
Tony has 8 years solid industry R& D experience and has converted many original research ideas into research prototype systems and eventually into commercial products. In his Ph.D. thesis (1995, University of Regina) entitled "Knowledge Discovery in Databases: An Attribute-Oriented Rough Set Approach", he introduced the rough set theory to data mining research and developed an attribute-oriented rough set approach for data mining and designed a research prototype system DBROUGH, which was later successfully transferred to the industry in Canada. From 1994-1998, he was a research scientist in data mining in Nortel Network Research Center, GTE Labs (Verizon Labs) etc. He had worked in many data mining related projects for real-time telephone switch system diagnosis, data managements, and wireless churn prediction. Among them, the CHAMP (CHurn Analysis, Modeling and Prediction) project was nominated for GTE’s highest technical achievement award in 1997. From 1998-2002, he had designed and developed data mining commercial software in various start-up companies (KSP, Blue Martini Software), KSP was acquired by Exchange Applications for $52 million in April 2000. He has successfully deployed a few data mining products/systems to some Fortune 100 companies such as Chase, Citibank, Sprint for credit fraud detection, e-personalization and customer management systems.
There is a tremendous amount of data spread across the web and stored in databases that can be turned into an integrated semantic network of data, called a knowledge graph. Knowledge graphs have been applied to a variety of challenging real-world problems including combating human, finding illegal arms sales in online marketplaces, and identifying threats in space. However, exploiting the available data to build knowledge graphs is difficult due to the heterogeneity of the sources, scale in the amount of data, and noise in the data. In this course I will review the techniques for building knowledge graphs, including extracting data from online sources, aligning the data to a common terminology, linking the data across sources, and representing knowledge graphs and querying them at scale.
►Part 1: Knowledge graphs
Web data extraction
►Part 2: Source alignment
►Part 3: Representing and querying knowledge graphs
Existing knowledge graphs to reuse
Background in computer science and some basic knowledge of AI, machine learning, and databases will be helpful, but not required.
Craig Knoblock is a Research Professor of both Computer Science and Spatial Sciences at the University of Southern California (USC), Keston Executive Director of the USC Information Sciences Institute, and Director of the Data Science Program at USC. He received his Bachelor of Science degree from Syracuse University and his Master’s and Ph.D. from Carnegie Mellon University in computer science. His research focuses on techniques for describing, acquiring, and exploiting the semantics of data. He has worked extensively on source modeling, schema and ontology alignment, entity and record linkage, data cleaning and normalization, extracting data from the Web, and combining all of these techniques to build knowledge graphs. He has published more than 300 journal articles, book chapters, and conference papers on these topics and has received 7 best paper awards on this work. Dr. Knoblock is a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), a Fellow of the Association of Computing Machinery (ACM), past President and Trustee of the International Joint Conference on Artificial Intelligence (IJCAI), and winner of the 2014 Robert S. Engelmore Award.
Harrison Distinguished Professor of Molecular Physiology and Biological Physics, University of Virginia. Development of methods for structural biology, in particular macromolecular structure determination by protein crystallography. Data management in structural biology, data mining as applied to drug discovery, bioinformatics. Member of Center of Structural Genomics of Infectious Diseases. Former Member of Midwest Center for Structural Genomics, New York Center for structural Genomics and Enzyme Function Initiative.
At present, we are going through the fourth industrial revolution fueled by data which has been termed as a 'new oil'. We can easily find examples where the insight drawn from large data sets is leading growth in different areas in industry. For example, large corpora of texts available in many different languages have led to breakthroughs in Natural Language Processing tasks such as neural machine translation, sentiment analysis, text summarization, question-answering, picture and video captioning etc. Like huge corpora of texts now we have many software repositories with billions of lines of code easily available on many different platforms such as GitHub, Bit-bucket, etc. We can call this data as 'Big Code' since it has all the four Vs (volume, variety, velocity, and veracity) of Big data. These software repositories not only have billions of lines of code but they also have associated texts in natural languages such as class, method and variable names, comments, doc strings, version history, bug reports etc. This huge volume of data makes a good case for applying machine learning on source code and draw insight for better software development. Although it also brings its unique challenges such as modelling the source code. In this course, I will discuss a framework for machine learning on source code. I will also present a set of use cases where this has been applied. I will introduce a set of tools and techniques to build an end-to-end inference pipeline starting from data acquisition, cleaning, feature engineering to modelling and inference for applying machine learning on source code.
Dr. Jayanti Prasad did his Ph.D in Physics (Astrophysics) from the Harish-Chandra Research Institute, Allahabad India and was a Post Doctoral Fellow at the National Centre for Astrophysics (NCRA) Pune, India and the Inter-University Center for Astronomy and Astrophysics (IUCAA) Pune, India. Dr. Prasad has also been a recipient of International and National Research grants and has published more than 100 research papers with small groups and large collaborations also. As a member of the LIGO Scientific Collaboration for six years, major discoveries were made such as the first detection of gravitational waves. Dr. Prasad also worked as a consultant for the LIGO Data Grid Center at the IUCAA. Dr. Prasad has worked on computing and data intensive problems in Astronomy and Astrophysics such as Galaxy formation and Clustering, Cosmological N-Body simulations, Radio Astronomy data processing, Cosmic Microwave Background Radiation and Gravitational Waves. Some of his works on the problems such as Particle Swarm Optimization and Maximum Entropy Deconvolution have been well received outside the Astronomy community also. Recently Dr. Prasad has been very much interested in Data Science, Machine Learning and Software Engineering and is working as data scientist for a Software company in India.
Recommender systems learn the preferences of their users and predict items that they will like. Recommender systems engines are implemented in many services today to various domains and are aimed at assisting users locate the exact item they need or like.
In this course we will cover the fundamental algorithms in recommender systems from collaborative-filtering, content-based and matrix factorization, through advanced deep learning techniques that were recently developed to enhance these basic approaches. We will discuss the challenges that the field is facing, such as sparstity, cold start, scalability, and reliable evaluation and measurement of recommender performance, and show how these challenges may be addressed with the recent developments.
• Types of recommenders
• Applying deep learning to recommendation
Applying deep learning algorithms to the different types of recommenders
• Real-world challenges and solutions in recommender systems
• Evaluating and measurement of recommender systems
• Hands-on practice in building recommender systems
Bracha Shapira is a professor at the Software and Information Systems Engineering Department at Ben-Gurion University of the Negev. She is currently the deputy dean of research at the Faculty of Engineering Sciences at BGU. She established the Information Retrieval Lab there and leads numerous research projects related to personalization, recommender systems, user profiling, privacy, and application of machine learning methods to cyber security. She has published more than 200 papers in major scientific conferences and journals and have several registered patents. Prof. Shapira is also the editor of "Recommender Systems Handbook" (1st edition, Springer, 2011; 2nd edition, 2015).
Lior Rokach is a professor of data science at the Ben-Gurion University of the Negev, where he currently serves as the chair of the Department of Software and Information System Engineering. His research interests lie in the areas of Machine Learning, Big Data, Deep Learning and Data Mining and their applications. Prof. Rokach is the author of over 300 peer-reviewed papers in leading journals and conference proceedings. Rokach has authored several popular books in data science, including Data Mining with Decision Trees (1st edition, World Scientific Publishing, 2007, 2nd edition, World Scientific Publishing, 2015). He is also the editor of "The Data Mining and Knowledge Discovery Handbook" (1st edition, Springer, 2005; 2nd edition, 2010) and "Recommender Systems Handbook" (1st edition, Springer, 2011; 2nd edition, 2015). He currently serves as an editorial board member of ACM Transactions on Intelligent Systems and Technology (ACM TIST) and an area editor for Information Fusion (Elsevier).
Real data often contain anomalous cases, also known as outliers. Depending on the situation, outliers may be (a) undesirable errors, which can adversely affect the data analysis, or (b) valuable nuggets of unexpected information. In either case, the ability to detect such anomalies is essential. A useful tool for this purpose is robust statistics, which aims to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. We present an overview of several robust methods and the resulting graphical outlier detection tools. We discuss robust procedures for univariate, low-dimensional, and high-dimensional data, such as estimating location and scatter, linear regression, and principal component analysis. Also the emerging topic of cellwise outliers is introduced.
Some basic knowledge of linear regression and principal components is helpful.
Peter Rousseeuw obtained his PhD at ETH Zurich, Switzerland and afterward became a professor in universities in The Netherlands, Switzerland and Belgium. For over a decade he worked full-time at Renaissance Technologies in the US. Currently he is Professor at KU Leuven in Belgium. His main research topics are cluster analysis (unsupervised classification) and anomaly detection by robust fitting, always with a focus on methodology as well as efficient algorithms and practical implementation. His work has been cited over 70,000 times. For more information see https://en.wikipedia.org/wiki/Peter_Rousseeuw and https://scholar.google.com/citations?user=5LMM6rsAAAAJ&hl=en .
Deep learning owes its success to the advent of massively parallel computing enabled by FPGAs (Field Programmable Gate Arrays), GPUs (Graphical Processing Units) and other special processors. However, many other neural network architectures can exploit such massively parallel computing. In this course, I will introduce the basic concepts and architectures of heterogeneous computing using FPGAs and GPUs. There are two basic languages for programming such hardware – OpenCL for FPGAs (from Intel, Xilinx and others) and CUDA for Nvidia GPUs. I will introduce the basic features of these languages and show how to implement parallel computations in these languages.
In the second part of this course, I will show how to implement some basic neural architectures on this kind of hardware. In addition, we can do much more with such hardware including feature selection, hyperparameter tuning and finding a good neural architecture. Finding the best combination of features, the best neural network design and the best hyperparameters is critical to neural networks. With the availability of massive parallelism, it is relatively easy now to explore, in parallel, many different combinations of features, neural network designs and hyperparameters.
In the last part of the course, I will discuss why it is becoming important that machine learning for IoT be at the edge of IoT instead of the cloud and how FPGAs and GPUs can facilitate that. And not just IoT, but in a wide range of application domains, from robotics to remote patient monitoring, localized machine learning from streaming sensor data is becoming increasingly important. GPUs, in particular, are available in a wide range of capabilities and prices and one can use them in many such applications where localized machine learning is desirable.
►Lecture 1: Massively parallel, heterogeneous computing using FPGAs and GPUs – heterogeneous computing concepts and architectures; comparison of FPGAs and GPUs; programming languages for parallel computing (OpenCL, CUDA)
►Lecture 2: Implementation of basic neural network algorithms on FPGAs and GPUs exploiting massive parallelism; exploiting massive parallelism to explore different feature combinations and neural network designs and for hyperparameter tuning
►Lecture 3: Machine learning at the edge of IoT in real-time from streaming sensor data using FPGAs and GPUs – classification, function approximation, clustering, anomaly detection
Fundamentals of computer science, basic knowledge of neural networks
Asim Roy is a professor of information systems at Arizona State University. He earned his bachelor's degree from Calcutta University, his master's degree from Case Western Reserve University, and his doctorate from the University of Texas at Austin. He has been a visiting scholar at Stanford University and a visiting scientist at the Robotics and Intelligent Systems Group at Oak Ridge National Laboratory, Tennessee. Professor Roy serves on the Governing Board of the International Neural Network Society (INNS) and is currently its VP of Industrial Relations. He is the founder of two INNS Sections, one on Autonomous Machine Learning and the other on Big Data Analytics. He was the Guest Editor-in-Chief of an open access eBook Representation in the Brain of Frontiers in Psychology. He was also the Guest Editor-in-Chief of two special issues of Neural Networks - one on autonomous learning and the other on big data analytics. He is the Senior Editor of Big Data Analytics and serves on the editorial boards of Neural Networks and Cognitive Computation.
He has served on the organizing committees of many scientific conferences. He started the Big Data conference series of INNS and was the General Co-Chair of the first one in San Francisco in 2015. He was the Technical Program Co-Chair of IJCNN 2015 in Ireland and the IJCNN Technical Program Co-Chair for the World Congress on Computational Intelligence 2018 (WCCI 2018) in Rio de Janeiro, Brazil. He is currently the IJCNN General Chair for WCCI 2020 in Glasgow, UK (https://www.wcci2020.org/). He is currently working on hardware-based (GPU, FPGA-based) machine learning for real-time learning from streaming data at the edge of the Internet of Things (IoT). He is also working on Explainable AI.
The representation of multidimensional, spatial, and metric data is an important issue in applications of spatial and spatiotextual databases, geographic information systems (GIS), and location-based services. Recently, there has been much interest in hierarchical data structures such as quadtrees, octrees, and pyramids which are based on image hierarchies, as well methods that make use of bounding boxes which are based on object hierarchies. Their key advantage is that they provide a way to index into space. In fact, they are little more than multidimensional sorts. They are compact and depending on the nature of the spatial data they save space as well as time and also facilitate operations such as search.
We describe hierarchical representations of points, lines, collections
of small rectangles, regions, surfaces, and volumes. For region data,
we point out the dimension-reduction property of the region quadtree
and octree. We also demonstrate how to use them for both raster and
vector data. For metric data that does not lie in a vector space so
that indexing is based simply on the distance between objects, we
review various representations such as the vp-tree, gh-tree, and
mb-tree. In particular, we demonstrate the close relationship
between these representations and those designed for a vector space.
For all of the representations, we show how they can be used to compute nearest objects in an incremental fashion so that the number of objects need not be known in advance. The VASCO JAVA applet is presented that illustrates these methods (found at http://www.cs.umd.edu/~hjs/quadtree/index.html). They are also used in applications such as the SAND Internet Browser (found at http://www.cs.umd.edu/~brabec/sandjava).
The above has been in the context of the traditional geometric
representation of spatial data, while in the final part we review the
more recent textual representation which is used in location-based
services where the key issue is that of resolving ambiguities.
For example, does
London'' correspond to the name of a person or a location, and if it corresponds to a location, which of the over 700 different instances ofLondon'' is it. The NewsStand system at
newsstand.umiacs.umd.edu and the TwitterStand system at
TwitterStand.umiacs.umd.edu system are examples. See also the
cover article of the October 2014 issue of Communications of the ACM
at http://tinyurl.com/newsstand-cacm or a cached version at
http://www.cs.umd.edu/~hjs/pubs/cacm-newsstand.pdf and the
accompanying video at https://vimeo.com/106352925
Practitioners working in the areas of big spatial data and spatial data science that involve spatial databases, geographic information systems, and location-based services will be given a different perspective on data structures found to be useful in most applications. Familiarity with computer terminology and some programming experience is needed to follow this course.
Hanan Samet (http://www.cs.umd.edu/~hjs/) is a Distinguished
University Professor of Computer Science at the University of
Maryland, College Park and is a member of the Institute for Computer
Studies. He is also a member
of the Computer Vision Laboratory at the Center for Automation Research
where he leads a number of research projects on the use of hierarchical
data structures for database applications, geographic information
systems, computer graphics, computer vision, image processing, games,
robotics, and search. He received the B.S. degree in engineering from
UCLA, and the M.S. Degree in operations research and the M.S. and
Ph.D. degrees in computer science from Stanford University. His
doctoral dissertation dealt with proving the correctness of
translations of LISP programs which was the first work in translation
validation and the related concept of proof-carrying code.
He is the author of the recent book
Foundations of Multidimensional and Metric Data Structures'' (http://www.cs.umd.edu/~hjs/multidimensional-book-flyer.pdf) published by Morgan-Kaufmann, an imprint of Elsevier, in 2006, an award winner in the 2006 best book in Computer and Information Science competition of the Professional and Scholarly Publishers (PSP) Group of the American Publishers Association (AAP), and of the first two books on spatial data structuresDesign and Analysis of Spatial Data
Structures'', and ``Applications of Spatial Data Structures: Computer
Graphics, Image Processing, and GIS,'' both published by Addison-Wesley
in 1990. He is the Founding Editor-In-Chief of the ACM Transactions
on Spatial Algorithms and Systems (TSAS), the founding chair of ACM
SIGSPATIAL, a recipient of a Science Foundation of Ireland (SFI)
Walton Visitor Award at the Centre for Geocomputation at the National
University of Ireland at Maynooth (NUIM), 2009 UCGIS Research
Award, 2010 CMPS Board of Visitors Award at the University of Maryland,
2011 ACM Paris Kanellakis Theory and Practice Award, 2014 IEEE
Computer Society Wallace McDowell Award, and a Fellow of the ACM,
IEEE, AAAS, IAPR (International Association for Pattern Recognition),
and UCGIS (University Consortium for Geographic Information Science). He was
recently elected to the SIGGRAPH Academy. He received best paper
awards in the 2007 Computers & Graphics Journal, the 2008 ACM SIGMOD
and SIGSPATIAL ACMGIS Conferences, the 2012 SIGSPATIAL MobiGIS
Workshop, and the 2013 SIGSPATIAL GIR Workshop, as well as a best demo
award at the 2011 SIGSPATIAL ACMGIS'11 Conference. The 2008 ACM
SIGSPATIAL ACMGIS best paper award winner also received the SIGSPATIAL
10-Year Impact Award. His paper at the 2009 IEEE International
Conference on Data Engineering (ICDE) was selected as one of the best
papers for publication in the IEEE Transactions on Knowledge and Data
Engineering. He was elected to the ACM Council as the Capitol Region
Representative for the term 1989-1991, and is an ACM Distinguished Speaker.
What is the statistically optimal way to detect and extract information from signals in noisy data? After detecting ensembles of signals, what can we learn about the population of all the signals? This course will address these questions using the language of Bayesian inference. After reviewing the basics of Bayes theorem, we will frame the problem of signal detection in terms of hypothesis testing and model selection. Extracting information from signals will be cast in terms of computing posterior density functions of signal parameters. After reviewing model selection and parameter estimation, the course will focus on practical methods. Specifically, we will implement sampling algorithms which we will use to perform model selection and parameter estimation on signals in synthetic data sets. Finally, we will ask what can be learned about the population properties of an ensemble of signals. This population-level inference will be studied as a hierarchical inference problem.
Basic probability theory, sampling, python and jupyter hub
Dr. Rory Smith is a lecturer in physics at Monash University in Melbourne, Australia. From 2013-2017, he was a senior postdoctoral fellow at the California Institute of Technology where he worked on searches for gravitational waves. Dr. Smith participated in the landmark first detection of gravitational waves for which the 2017 Nobel Prize in physics was awarded. Dr. Smith’s research focuses on detecting astrophysical gravitational-wave signals from black holes and neutron stars, and extracting the rich astrophysical information encoded within to study the fundamental nature of spacetime.
Social Computing is an emerging discipline, and just like any discipline at a nascent stage it can often mean different things to different people. However, there are three distinct threads that are emerging. First thread is often called Socio-Technical Systems, which focuses on building systems that allow large scale interactions of people, whether for a specific purpose or in general. Examples include social networks like Facebook and Google Plus, and Multi Player Online Games like World of Warcraft and Farmville. The second thread is often called Computational Social Science, whose goal is to use computing as an integral tool to push the research boundaries of various social and behavioral science disciplines, primarily Sociology, Economics, and Psychology. Third is the idea of solving problems of societal relevance using a combination of computing and humans. The goal of this course is to discuss, in a tutorial manner, through case studies, and through discussion, what Social Computing is, where it is headed, and where is it taking us.
• Module 1: Introduction
• Module 2: Science
• Module 3: Applications
This course is intended primarily for graduate students. Following are the potential audiences:
Will provide later.
Jaideep Srivastava (https://www.linkedin.com/in/jaideep-srivastava-50230/) is Professor of Computer Science at the University of Minnesota, where he directs a laboratory focusing on research in Web Mining, Social Analytics, and Health Analytics. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), and has been an IEEE Distinguished Visitor and a Distinguished Fellow of Allina’s Center for Healthcare Innovation. He has been awarded the Distinguished Research Contributions Award of the PAKDD, for his lifetime contributions to the field of machine learning and data mining. He has supervised 39 PhD dissertations, and over 65 MS theses. He has also mentored a number of post-doctoral fellows and junior scientists in the industry and academia. He has authored or co-authored over 420 papers in journals and conferences, and filed 8 patents. Seven of his papers have won best paper awards, and he has a Google Scholar citation count of over 25,649 and an h-index of 59 (https://scholar.google.com/citations?user=Y4J5SOwAAAAJ&hl=en&oi=ao).
Dr. Srivastava’s research has been supported by a broad range of government agencies, including NSF, NASA, ARDA, DARPA, IARPA, NIH, CDC, US Army, US Air Force, and MNDoT; and industries, including IBM, United Technologies, Eaton, Honeywell, Cargill, Allina and Huawei. He is a regular participant in the evaluation committees of various US and international funding agencies, on the organizing and steering committees of various international scientific forums, and on the editorial boards of a number of journals.
Dr. Srivastava has significant experience in the industry, in both consulting and executive roles. Most recently he was the Chief Scientist for Qatar Computing Research Institute (QCRI), which is part of Qatar Foundation. Earlier, he was the data mining architect for Amazon.com (www.amazon.com), built a data analytics department at Yodlee (www.yodlee.com), and served as the Chief Technology Officer for Persistent Systems (www.persistentsys.com). He has provided technology and strategy advice to Cargill, United Technologies, IBM, Honeywell, KPMG, 3M, TCS, Cargill and Eaton. Dr. Srivastava Co-Founded Ninja Metrics (www.ninjametrics.com), based on his research in behavioral analytics. He was advisor and Chief Scientist for CogCubed (www.cogcubed.com), an innovative company with the goal to revolutionize the diagnosis and therapy of cognitive disorders through the use of online games, which was subsequently acquired by Teladoc (https://www.teladoc.com/), and for Jornaya (https://www.jornaya.com/). He is presently a technology advisor to a number of startups at various stages, including Kipsu (http://kipsu.com/) - which provides an innovative solution to improving service quality in the hospitality industry, and G2lytics (https://g2lytics.com/) – an organization that uses machine learning to identify tax compliance problems.
Dr. Srivastava has held distinguished professorships at Heilongjiang University and Wuhan University, China. He has held advisory positions with the State of Minnesota, and the State of Maharashtra, India. He is an advisor to the Unique ID (UID) project of the Government of India, whose goal is to provide biometrics-based social security numbers to the 1.25+ billion citizens of India.
Dr. Srivastava has delivered over 170 invited talks in over 35 countries, including more than a dozen keynote addresses at major international conferences. He has a Bachelors of Technology from the Indian Institute of Technology (IIT), Kanpur, India, and MS and PhD from the University of California, Berkeley.
Meta-analysis plays an important role in summarizing and synthesizing scientific evidence derived from multiples studies. By combining multiple data sources, higher statistical power is achieved, leading to more accurate effect estimates and greater reproducibility. While the wealth of omics data in public repositories offers great opportunities to understand molecular mechanisms and identify biomarkers of human diseases, differences in design and methodology across studies often translates into poor reproducibility. Hence, the importance of meta-analytical approaches to make more robust inferences from this type of data.
In this course, we will learn the most common meta-analysis strategies to integrate high-throughput biological data and to implement such analysis using R capabilities. All practical exercises will be conducted in R. Participants are encouraged to bring datasets to the course and apply the principles to their specific areas of research.
Students must be proficient in R and familiar with the analysis of some of the omics data types. Familiarity with statistical concepts and basic understanding of regression and ANOVA. An installed version of R (https://cran.r-project.org/) and R-Studio (https://www.rstudio.com/) on a laptop is required for completing exercises.
Mayte Suarez-Farinas, PhD, is currently an Associate Professor at the Center for Biostatistics and The Department of Genetics and Genomics Science of the Icahn School of Medicine at Mount Sinai, New York. She received an MSc in mathematics from the University of Havana, Cuba, in year 1995, and a Ph.D. in quantitative analysis from the Pontifical Catholic University of Rio de Janeiro, Brazil, in 2003. Prior to joining Mount Sinai, she was co-director of Biostatistics at the Center for Clinical and Translational Science at the Rockefeller University, where she developed methodologies for data integration across omics studies and a framework to evaluate drug response at the molecular level in proof of concept studies in inflammatory skin diseases using mixed-effect models and machine learning. Her long terms goals are to develop robust statistical techniques to mine and integrate complex high-throughput data, with an emphasis on immunological diseases, and to develop precision medicine algorithms to predict treatment response and phenotype.
We shall study algorithms that have been found useful in querying large datasets. The emphasis is on algorithms that cannot be considered "machine learning."
A course in algorithms at the advanced-undergraduate level is important.
We will be covering (parts of) Chapters 3, 4, and 10 of the free text: Mining of Massive Datasets (third edition) by Jure Leskovec, Anand Rajaraman, and Jeff Ullman, available at www.mmds.org
A brief on-line bio is available at i.stanford.edu/~ullman/pub/opb.txt
Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. The use of process mining is rapidly increasing and there are over 30 commercial vendors of process mining software. Through concrete data sets and easy-to-use software, the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.
The course explains the key analysis techniques in process mining. Participants will learn various process discovery algorithms. These can be used to automatically learn process models from raw event data. Various other process analysis techniques that use event data will be presented. Moreover, the course will provide easy-to-use software, real-life data sets, and practical skills to directly apply the theory in a variety of application domains.
Process mining provides not only a bridge between data mining and business process management; it also helps to address the classical divide between "business" and "IT". Evidence-based business process management based on process mining helps to create a common ground for business process improvement and information systems development.
Note that Gartner identified process-mining software as a new and important class of software. On can witness the rapid uptake looking at the successful vendors (e.g., Celonis, Disco, ProcessGold, myInvenio, PAFnow, Minit, QPR, Mehrwerk, Puzzledata, LanaLabs, StereoLogic, Everflow, TimelinePI, Signavio, and Logpickr) and the organizations applying process mining at a large scale with thousands of users (e.g., Siemens and BMW). Yet many traditional mainstream-oriented data scientists (machine learners and data miners) are not aware of this. This explains the relevance of the course for BigDat 2020 participants.
The course focuses on process mining as the bridge between data science and process science. The course will introduce the three main types of process mining.
The course uses many examples using real-life event logs to illustrate the concepts and algorithms. After taking this course, one is able to run process mining projects and have a good understanding of the Business Process Intelligence field.
W.M.P. van der Aalst. Process Mining: Data Science in Action. Springer-Verlag, Berlin, 2016. (The course will also provide access to slides, several articles, software tools, and data sets.)
This course is aimed at both students (Master or PhD level) and professionals. A basic understanding of logic, sets, and statistics (at the undergraduate level) is assumed. Basic computer skills are required to use the software provided with the course (but no programming experience is needed). Participants are also expected to have an interest in process modeling and data mining but no specific prior knowledge is assumed as these concepts are introduced in the course.
Prof.dr.ir. Wil van der Aalst is a full professor at RWTH Aachen University leading the Process and Data Science (PADS) group. He is also part-time affiliated with the Fraunhofer-Institut f¸r Angewandte Informationstechnik (FIT) where he leads FIT's Process Mining group and the Technische Universiteit Eindhoven (TU/e). Until December 2017, he was the scientific director of the Data Science Center Eindhoven (DSC/e) and led the Architecture of Information Systems group at TU/e. Since 2003, he holds a part-time position at Queensland University of Technology (QUT). Currently, he is also a distinguished fellow of Fondazione Bruno Kessler (FBK) in Trento and a member of the Board of Governors of Tilburg University. His research interests include process mining, Petri nets, business process management, workflow management, process modeling, and process analysis. Wil van der Aalst has published over 220 journal papers, 20 books (as author or editor), 500 refereed conference/workshop publications, and 75 book chapters. Many of his papers are highly cited (he one of the most cited computer scientists in the world; according to Google Scholar, he has an H-index of 144 and has been cited over 96,000 times) and his ideas have influenced researchers, software developers, and standardization committees working on process support. Next to serving on the editorial boards of over ten scientific journals, he is also playing an advisory role for several companies, including Fluxicon, Celonis, Processgold, and Bright Cape. Van der Aalst received honorary degrees from the Moscow Higher School of Economics (Prof. h.c.), Tsinghua University, and Hasselt University (Dr. h.c.). He is also an elected member of the Royal Netherlands Academy of Arts and Sciences, the Royal Holland Society of Sciences and Humanities, and the Academy of Europe. In 2018, he was awarded an Alexander-von-Humboldt Professorship.