Course Description

Courses



  • Sanchita Bhattacharya
    (University of California, San Francisco) [introductory/advanced]
    Big Data in Immunology: Sharing, Dissemination, and Repurposing


    Diego Calvanese
    (Free University of Bozen-Bolzano) [introductory]
    Virtual Knowledge Graphs for Data Integration


    Sheelagh Carpendale
    (University of Calgary) [introductory]
    Data Visualization


    Nitesh Chawla
    (University of Notre Dame) [intermediate/advanced]
    Learning from Imbalanced Data


    Amr El Abbadi
    (University of California, Santa Barbara) [introductory/intermediate]
    An Introduction to Blockchain

    Summary

    The rise of Bitcoin and other peer-to-peer cryptocurrencies has opened many interesting and challenging problems in cryptography, distributed systems, and databases. The main underlying data structure is blockchain, a scalable fully replicated structure that is shared among all participants and guarantees a consistent view of all user transactions by all participants in the system. In this course, we discuss the basic protocols used in blockchain, and elaborate on its main advantages and limitations. To overcome these limitations, we provide the necessary distributed systems background in managing large scale fully replicated ledgers, using Byzantine Agreement protocols to solve the consensus problem. Finally, we expound on some of the most recent proposals to design scalable and efficient blockchains in both permissionless and permissioned settings. The focus of the tutorial is on the distributed systems and database aspects of the recent innovations in blockchains.

    Syllabus

    • ►Introduction to Bitcoin ad Blockchain
    • ►Fundamental of digital signatures
    • ►Basics of hashing.
    • ►Crypto-currencies
    • ►Fundamental of consensus
    • ►Mining and Proof of Work
    • ►The challenge of double spending and how to overcome it.
    • ►The challenge of scalability in blockchain.
    • ►A brief summary of permissioned blockchains.
    • ►Atomic Swap between different blockchains
    • ►Lightning Networks and off-chain transactions.

    References

    • ►1. Satoshi Nakamoto. 2008. Bitcoin: A peer-to-peer electronic cash system. (2008).
    • ►2. Tier Nolan. 2013. Alt chains and atomic transfers. https://bitcointalk.org/index.php?topic=193281.msg2224949/msg2224949. (2013).
    • ►3. Elli Androulaki, Artem Barger, Vita Bortnikov, Christian Cachin, Konstantinos Christidis, Angelo De Caro, David Enyeart, Christopher Ferris, Gennady Laventman, Yacov Manevich, et al. 2018. Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains. arXiv preprint arXiv:1801.10228 (2018).
    • ►4. Miguel Castro, Barbara Liskov, et al. 1999. Practical Byzantine fault tolerance. In OSDI, Vol. 99. 173–186.
    • ►5. Joseph Poon and Thaddeus Dryja. 2016. The bitcoin lightning network: Scalable off-chain instant payments. See https://lightning. network/lightning-network-paper. pdf (2016).
    • ►6. Leslie Lamport et al. 2001. Paxos made simple. ACM Sigact News 32, 4 (2001), 18–25.
    • ►7. Leslie Lamport, Robert Shostak, and Marshall Pease. 1982. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems (TOPLAS) 4, 3 (1982), 382–401.
    • ►8. Maurice Herlihy. 2018. Atomic cross-chain swaps. In Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing. ACM, 245–254.
    • ►9. Michael J Fischer, Nancy A Lynch, and Michael S Paterson. 1985. Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM) 32, 2 (1985), 374–382.

    Pre-requisites

    Basic knowledge of data structures and operating systems.

    Short Bio

    Amr El Abbadi is a Professor of Computer Science at the University of California, Santa Barbara. He received his B. Eng. from Alexandria University, Egypt, and his Ph.D. from Cornell University. His research interests are in the fields of fault-tolerant distributed systems and databases, focusing recently on Cloud data management and blockchain based systems. Prof. El Abbadi is an ACM Fellow, AAAS Fellow, and IEEE Fellow. He was Chair of the Computer Science Department at UCSB from 2007 to 2011. He has served as a journal editor for several database journals, including, The VLDB Journal, IEEE Transactions on Computers and The Computer Journal. He has been Program Chair for multiple database and distributed systems conferences. He currently serves on the executive committee of the IEEE Technical Committee on Data Engineering (TCDE) and was a board member of the VLDB Endowment from 2002 to 2008. In 2007, Prof. El Abbadi received the UCSB Senate Outstanding Mentorship Award for his excellence in mentoring graduate students. In 2013, his student, Sudipto Das received the SIGMOD Jim Gray Doctoral Dissertation Award. Prof. El Abbadi is also a co-recipient of the Test of Time Award at EDBT/ICDT 2015. He has published over 300 articles in databases and distributed systems and has supervised over 35 PhD students.



    Charles Elkan
    (University of California, San Diego) [intermediate]
    A Rapid Introduction to Modern Deep Learning

    Summary

    These three lectures will explain the essential theory and practice of neural networks as used in current applications.

    Syllabus

    • 1: The simplest neural network: logistic regression from the perspective of conditional maximum likelihood and stochastic gradient descent.
    • 2: The core of deep learning: loss functions, backpropagation, adaptive gradient descent.
    • 3: Modern deep learning: dropout, recurrent and residual and convolutional networks.

    References

    The lectures will be less mathematical than https://mitpress.mit.edu/books/deep-learning but more mathematical than https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/. Both these books are highly recommended.

    Pre-requisites

    Mathematics at the level of an undergraduate degree in computer science: basic multivariate calculus, probability theory, and linear algebra.

    Short Bio

    Charles Elkan is the global head of machine learning and a managing director at Goldman Sachs in New York. He is also an adjunct professor of computer science at the University of California, San Diego (UCSD). From 2014 to 2018 he was the first Amazon Fellow, leading a team of over 30 scientists and engineers in Seattle, Palo Alto, and New York doing research and development in applied machine learning in both e-commerce and cloud computing. Before joining Amazon, he was a full-time professor of computer science at UCSD. His Ph.D. is from Cornell in computer science, and his undergraduate degree is from Cambridge in mathematics. For publications, see https://scholar.google.com/citations?user=im5aMngAAAAJ&hl=en



    Minos Garofalakis
    (Technical University of Crete) [intermediate/advanced]
    Private Data Analytics at Scale


    Jiawei Han
    (University of Illinois, Urbana-Champaign) [intermediate/advanced]
    From Unstructured Text to TextCube: Automated Construction and Multidimensional Exploration

    Summary

    The real-world big data are largely unstructured, interconnected, and dynamic, in the form of natural language text. It is highly desirable to transform such massive unstructured data into structured knowledge. Many researchers rely on labor-intensive labeling and curation to extract knowledge from such data. However, such approaches may not be scalable, especially considering that a lot of text corpora are highly dynamic and domain-specific. We argue that massive text data itself may disclose a large body of hidden patterns, structures, and knowledge. Equipped with domain-independent and domain-dependent knowledge-bases, we should explore the power of massive data itself for turning unstructured data into structured knowledge. Moreover, by organizing massive text documents into multidimensional text cubes, we show structured knowledge can be extracted and used effectively. In this talk, we introduce a set of methods developed recently in our group for such an exploration, including mining quality phrases, entity recognition and typing, multi-faceted taxonomy construction, and construction and exploration of multi-dimensional text cubes. We show that data-driven approach could be an promising direction at transforming massive text data into structured knowledge.

    Syllabus

    • • Part 1: Introduction
    • • Part 2: Automated Phrase Mining
    • • Part 3: Automated Entity/Relation Recognition
    • • Part 4: Embedding Methods and Taxonomy Mining
    • • Part 5: Text Classification, Text Cube Construction and Exploration
    • • Part 6: Looking forward

    References

    • • Jialu Liu, Jingbo Shang, and Jiawei Han, Phrase Mining from Massive Text and Its Applications, Morgan & Claypool Publishers, 2017
    • • Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Heng Ji, Jiawei Han, “ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering”, in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
    • • Meng Jiang, Jingbo Shang, Taylor Cassidy, Xiang Ren, Lance M. Kaplan, Timothy P. Hanratty, and Jiawei Han, “MetaPAD: Meta Pattern Discovery from Massive Text Corpora”, KDD’17, Aug. 2017
    • • Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL 2019
    • • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Distributed representations of words and phrases and their compositionality,” NIPS 2013
    • • Chao Zhang and Jiawei Han, Multidimensional Mining of Massive Text Data, Morgan & Claypool Publishers, 2019
    • • Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao Zhang, Honglei Zhuang, Lance Kaplan and Jiawei Han, "Spherical Text Embedding", in Proc. 2019 Conf. on Neural Information Processing Systems (NeurIPS’19), Vancouver, Canada, Dec. 2019

    Pre-requisites

    • Familiarity with elementary knowledge about machine learning, data mining and natural language processing.

    Short Bio

    Jiawei Han is Michael Aiken Chair Professor in the Department of Computer Science, University of Illinois at Urbana-Champaign. He has been researching into data mining, information network analysis, database systems, and data warehousing, with over 900 journal and conference publications. He has chaired or served on many program committees of international conferences in most data mining and database conferences. He also served as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data (2008-2011), the Director of Information Network Academic Research Center supported by U.S. Army Research Lab (2009-2016), and the co-Director of KnowEnG, an NIH funded Center of Excellence in Big Data Computing since 2014. He is a Fellow of ACM and a Fellow of IEEE. He received ACM SIGKDD Innovations Award (2004), IEEE Computer Society Technical Achievement Award (2005), IEEE W. Wallace McDowell Award (2009), Japan's Funai Achievement Award (2018), and have been named as Michael Aiken Chair Professor at the University of Illinois (2019).



    Xiaohua Tony Hu
    (Drexel University) [introductory/advanced]
    Machine Learning Methods for Big Microbiome Data Analysis

    Summary

    Microorganisms are abundant in nature and play a vital role in ecosystems. The human body, as well as one of the hosts, process thousands of microbial populations. In the past few years, the biological sequencing technology and experiments have been conducting. A mass of high-throughput sequencing data is the challenge for identifying the crucial role of microbes in their hosts. In this course, we will introduce the background of microbiome and its implications. Then we will introduce several aspect of big microbiome data mining and analysis. Firstly, we will talk about computational method for visualizing microbiome data, for example, some dimension reduction and visualization method including t-SNE and principle coordinate analysis. Furthermore, we will discuss computational methods for the classification of microbiome samples. These include nonnegative matrix factorization, random forest and other deep learning methods, as well as the feature selection and biomarker discovery for microbe relate diseases. We will also discuss some computational technologies for construct and analyzing microbial interactions from microbial profile data.

    Syllabus

    • 1、 Introduction of big microbiome data mining and analysis
    • 1.1 What is microbiome and its implication
    • 1.2 Overview of microbiome data mining and analysis
    • 2、 Data visualization for microbiome data
    • 2.1 Introduction of data visualization for microbiome data
    • 2.2 Principle Coordinate Analysis and Phylogenetic tree
    • 2.3 t-SNE and more
    • 3、 Classification of microbiome samples
    • 3.1 NMF and Random Forest
    • 3.2 Deep learning methods for classification
    • 3.3 Feature selection and biomarker discovery
    • 4、 Predicting microbial interaction network
    • 4.1 Correlation and similarity analysis
    • 4.2 Inferring high-order relationship
    • 4.3 Network-based association prediction
    • 4.4 Microbial interaction network analysis

    References

    Included in the ppts at the lecture

    Pre-requisites

    • Machine learning skills and bioinformatics introductory level

    Short Bio

    Xiaohua Tony Hu (Ph.D, 1995) is a full professor and the founding director of the data mining and bioinformatics lab at the College of Computing and Informatics (the former College of Information Science and Technology, one of the best information science schools in USA, ranked as #1 in 1999 and #6 in 2010 in information systems by U.S. News & World Report). He is also serving as the founding Co-Director of the NSF Center (I/U CRC) on Visual and Decision Informatics (NSF CVDI), IEEE Computer Society Bioinformatics and Biomedicine Steering Committee Chair, and IEEE Computer Society Big Data Steering Committee Chair. Tony is a scientist, teacher and entrepreneur. He joined Drexel University in 2002. He founded the International Journal of Data Mining and Bioinformatics (SCI indexed) in 2006. Earlier, he worked as a research scientist in the world-leading R&D centers such as Nortel Research Center, and Verizon Lab (the former GTE labs). In 2001, he founded the DMW Software in Silicon Valley, California. He has a lot of experience and expertise to convert original ideas into research prototypes, and eventually into commercial products, many of his research ideas have been integrated into commercial products and applications in data mining fraud detection, database marketing.

    Tony’s current research interests are in big data, data/text/web mining, bioinformatics, information retrieval and information extraction, social network analysis, healthcare informatics. He has published more than 280 peer-reviewed research papers (google citation more than 20000) in various journals, conferences and books such as various IEEE/ACM Transactions (IEEE/ACM TCBB, IEEE TFS, IEEE TDKE, IEEE TITB, IEEE SMC, IEEE Computer, IEEE NanoBioScience, IEEE Intelligent Systems), JIS, KAIS, CI, DKE, IJBRA, SIG KDD, IEEE ICDM, IEEE ICDE, SIGIR, ACM CIKM, IEEE BIBE, IEEE CICBC etc, co-edited 20 books/proceedings. He has received a few prestigious awards including the 2005 National Science Foundation (NSF) Career award, the best paper award at the 2007 International Conference on Artificial Intelligence, the best paper award at the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, the 2010 IEEE Granular Computing Outstanding Contribution Awards, the 2007 IEEE Bioinformatics and Bioengineering Outstanding Contribution Award, the 2006 IEEE Granular Computing Outstanding Service Award, and the 2001 IEEE Data Mining Outstanding Service Award. He has also served as a program co-chair/conference co-chair of 14 international conferences/workshops and a program committee member in more than 80 international conferences in the above areas. He is the founding editor-in-chief of the International Journal of Data Mining and Bioinformatics (SCI indexed), International Journal of Granular Computing, Rough Sets and Intelligent Systems, an associate editor/editorial board member of four international journals (KAIS, IJDWM, IJSOI and JCIB). His research projects are funded by the National Science Foundation (NSF), US Dept. of Education, the PA Dept. of Health and Industries Labs. He has obtained more than US$9.0 million research grants in the past 12 years as PI or Co-PI (PIs of 9 NSF grants, PI of 1 IMLS grant in the last 10 years). He has graduated 24 Ph.D. students from 2006 to 2019 and is currently supervising 6 Ph.D. students.

    Tony is the founding Co-Director of the NSF Center for Visual and Decision Informatics (NSF CVDI), there are about 60 such centers nationwide supported by the NSF Industry/University Cooperative Research Center program in all the research disciplines covered by NSF. The CVDI is the “National Center of Excellence” to deal with the Big Data Challenges. The center is funded by NSF, the members from industry and government, and university matching funds. The current industry members and government agencies associated with Drexel University are: Children Hospital of Philadelphia, Elsevier, Institute of Museum and Library, Johnson & Johnson, Microsoft Research, Penn Dept. of Health, Thomson Reuters, SunGuard LLP, Lockheed Martin, IMS Healthcare and SOI. The CVDI serves to drive continuous innovation through knowledge sharing among partners leading to invention and commercialization of information and knowledge engineering technologies for decision support, research and develop next generation data mining, visual and decision support tools & techniques to enable decision makers in government and industry to fundamentally improve the way their organization’s information is interpreted and analyzed. Currently, there are about 30 faculty and staff from Drexel University and University of Louisiana at Lafayette, and 12 industry companies/government agencies participated in various research projects.

    Tony has 8 years solid industry R& D experience and has converted many original research ideas into research prototype systems and eventually into commercial products. In his Ph.D. thesis (1995, University of Regina) entitled "Knowledge Discovery in Databases: An Attribute-Oriented Rough Set Approach", he introduced the rough set theory to data mining research and developed an attribute-oriented rough set approach for data mining and designed a research prototype system DBROUGH, which was later successfully transferred to the industry in Canada. From 1994-1998, he was a research scientist in data mining in Nortel Network Research Center, GTE Labs (Verizon Labs) etc. He had worked in many data mining related projects for real-time telephone switch system diagnosis, data managements, and wireless churn prediction. Among them, the CHAMP (CHurn Analysis, Modeling and Prediction) project was nominated for GTE’s highest technical achievement award in 1997. From 1998-2002, he had designed and developed data mining commercial software in various start-up companies (KSP, Blue Martini Software), KSP was acquired by Exchange Applications for $52 million in April 2000. He has successfully deployed a few data mining products/systems to some Fortune 100 companies such as Chase, Citibank, Sprint for credit fraud detection, e-personalization and customer management systems.



    Craig Knoblock
    (University of Southern California) [intermediate/advanced]
    Building Knowledge Graphs

    Summary

    There is a tremendous amount of data spread across the web and stored in databases that can be turned into an integrated semantic network of data, called a knowledge graph. Knowledge graphs have been applied to a variety of challenging real-world problems including combating human, finding illegal arms sales in online marketplaces, and identifying threats in space. However, exploiting the available data to build knowledge graphs is difficult due to the heterogeneity of the sources, scale in the amount of data, and noise in the data. In this course I will review the techniques for building knowledge graphs, including extracting data from online sources, aligning the data to a common terminology, linking the data across sources, and representing knowledge graphs and querying them at scale.

    Syllabus

    • ►Part 1: Knowledge graphs

    • Web data extraction

    • ►Part 2: Source alignment

    • Entity linking

    • ►Part 3: Representing and querying knowledge graphs

    • Existing knowledge graphs to reuse

    References

    Pre-requisites

    Background in computer science and some basic knowledge of AI, machine learning, and databases will be helpful, but not required.

    Short Bio

    Craig Knoblock is a Research Professor of both Computer Science and Spatial Sciences at the University of Southern California (USC), Keston Executive Director of the USC Information Sciences Institute, and Director of the Data Science Program at USC. He received his Bachelor of Science degree from Syracuse University and his Master’s and Ph.D. from Carnegie Mellon University in computer science. His research focuses on techniques for describing, acquiring, and exploiting the semantics of data. He has worked extensively on source modeling, schema and ontology alignment, entity and record linkage, data cleaning and normalization, extracting data from the Web, and combining all of these techniques to build knowledge graphs. He has published more than 300 journal articles, book chapters, and conference papers on these topics and has received 7 best paper awards on this work. Dr. Knoblock is a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), a Fellow of the Association of Computing Machinery (ACM), past President and Trustee of the International Joint Conference on Artificial Intelligence (IJCAI), and winner of the 2014 Robert S. Engelmore Award.



    Wladek Minor
    (University of Virginia) [introductory/advanced]
    Big Data in Biomedical Sciences

    Syllabus:

    • ▸ Big Data and Big Data in Biomedical Sciences
    • ▸ Why Big Data is perceived as a big problem - technological considerations
    • ▸ Data reduction - should we preserve unreduced (raw) data?
    • ▸ Databases and databanks
    • ▸ Data mining with the use of raw data
    • ▸ Data mining in databanks and databases
    • ▸ Data Integration
    • ▸ Automatic and semi-automatic curation of large amount of data
    • ▸ Conversion of databanks into databases and Advanced Information Systems
    • ▸ Experimental results and knowledge
    • ▸ Database priorities – content and design
    • ▸ Interaction between databases
    • ▸ Modern data management in biomedical sciences – necessity or luxury
    • ▸ Automatic data harvesting – close reality or still on the horizon
    • ▸ Reproducibility of the biomedical experiments - drug discovery considerations
    • ▸ Artificial Intelligence and machine learning in drug discovery
    • ▸ Big data in medicine - new possibilities
    • ▸ Personalized medicine
    • ▸ Future considerations

    Short Bio

    Harrison Distinguished Professor of Molecular Physiology and Biological Physics, University of Virginia. Development of methods for structural biology, in particular macromolecular structure determination by protein crystallography. Data management in structural biology, data mining as applied to drug discovery, bioinformatics. Member of Center of Structural Genomics of Infectious Diseases. Former Member of Midwest Center for Structural Genomics, New York Center for structural Genomics and Enzyme Function Initiative.



    Bamshad Mobasher
    (DePaul University) [intermediate]
    Context-aware Recommender Systems


    Jayanti Prasad
    (Embold Technologies) [introductory/intermediate]
    Big Code

    Summary

    At present, we are going through the fourth industrial revolution fueled by data which has been termed as a 'new oil'. We can easily find examples where the insight drawn from large data sets is leading growth in different areas in industry. For example, large corpora of texts available in many different languages have led to breakthroughs in Natural Language Processing tasks such as neural machine translation, sentiment analysis, text summarization, question-answering, picture and video captioning etc. Like huge corpora of texts now we have many software repositories with billions of lines of code easily available on many different platforms such as GitHub, Bit-bucket, etc. We can call this data as 'Big Code' since it has all the four Vs (volume, variety, velocity, and veracity) of Big data. These software repositories not only have billions of lines of code but they also have associated texts in natural languages such as class, method and variable names, comments, doc strings, version history, bug reports etc. This huge volume of data makes a good case for applying machine learning on source code and draw insight for better software development. Although it also brings its unique challenges such as modelling the source code. In this course, I will discuss a framework for machine learning on source code. I will also present a set of use cases where this has been applied. I will introduce a set of tools and techniques to build an end-to-end inference pipeline starting from data acquisition, cleaning, feature engineering to modelling and inference for applying machine learning on source code.

    Syllabus

    • ►Introduction to programming languages, Intermediate representation and compilation.
    • ►Tools and techniques for processing text data.
    • ►Source code modelling and Abstract syntax Tree (AST).
    • ►Language models of natural and programming languages.
    • ►Machine Learning on source code.
    • ►LSTM based Neural translation models.
    • ►Use cases.

    References

    • ► 1. Programming Languages: Principles and Paradigms (Undergraduate Topics in Computer Science) 2010th Edition by Maurizio Gabbrielli (Author), Simone Martini (Contributor)
    • ► 2. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)
    • ► 3. A Survey of Machine Learning for Big Code and Naturalness (arXiv:1709.06182 [cs.SE]), Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton
    • ► 4. bblfshd: A self-hosted server for source code parsing (https://github.com/bblfsh/bblfshd)
    • ► 5. Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals, Quoc V. Le (https://arxiv.org/abs/1409.3215?context=cs)
    • ► 6. code2vec: Learning Distributed Representations of Code, Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav (arXiv:1803.09473 [cs.LG])

    Pre-requisites

    • ►Basic knowledge of programming languages and compilation
    • ►Object Oriented Python Programming
    • ►Basic knowledge of machine learning and neural networks
    • ►Familiarity with Tensorflow & keras
    • ►Elementary concepts of probability, statistics and linear algebra
    • ►Software development process and version control system

    Short Bio

    Dr. Jayanti Prasad did his Ph.D in Physics (Astrophysics) from the Harish-Chandra Research Institute, Allahabad India and was a Post Doctoral Fellow at the National Centre for Astrophysics (NCRA) Pune, India and the Inter-University Center for Astronomy and Astrophysics (IUCAA) Pune, India. Dr. Prasad has also been a recipient of International and National Research grants and has published more than 100 research papers with small groups and large collaborations also. As a member of the LIGO Scientific Collaboration for six years, major discoveries were made such as the first detection of gravitational waves. Dr. Prasad also worked as a consultant for the LIGO Data Grid Center at the IUCAA. Dr. Prasad has worked on computing and data intensive problems in Astronomy and Astrophysics such as Galaxy formation and Clustering, Cosmological N-Body simulations, Radio Astronomy data processing, Cosmic Microwave Background Radiation and Gravitational Waves. Some of his works on the problems such as Particle Swarm Optimization and Maximum Entropy Deconvolution have been well received outside the Astronomy community also. Recently Dr. Prasad has been very much interested in Data Science, Machine Learning and Software Engineering and is working as data scientist for a Software company in India.

    More information at: https://scholar.google.co.in/citations?user=hdn0Ln0AAAAJ&hl=en https://ui.adsabs.harvard.edu/search/q=author%3A%22Prasad%2C%20Jayanti%22&sort=date%20desc%2C%20bibcode%20desc&p_=0



    Lior Rokach and Bracha Shapira
    (Ben-Gurion University of the Negev) [introductory/intermediate]
    Recommender Systems


    Peter Rousseeuw
    (KU Leuven) [introductory]
    Anomaly Detection by Robust Methods

    Summary

    Real data often contain anomalous cases, also known as outliers. Depending on the situation, outliers may be (a) undesirable errors, which can adversely affect the data analysis, or (b) valuable nuggets of unexpected information. In either case, the ability to detect such anomalies is essential. A useful tool for this purpose is robust statistics, which aims to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. We present an overview of several robust methods and the resulting graphical outlier detection tools. We discuss robust procedures for univariate, low-dimensional, and high-dimensional data, such as estimating location and scatter, linear regression, and principal component analysis. Also the emerging topic of cellwise outliers is introduced.

    Syllabus

    • ▸ Introduction to robust methods: concepts of influence function and breakdown value, univariate robust estimators. Includes explicit robust estimators of location, scale and skewness.
    • ▸ Robust estimators of multivariate location and scatter, including Stahel-Donoho and Minimum Covariance Determinant.
    • ▸ Robust linear regression, including Least Trimmed Squares, and regression with categorical predictors.
    • ▸ Robust principal component analysis (PCA): principles, projection pursuit, spherical PCA, ROBPCA.
    • ▸ Cellwise outliers: concepts and detection.

    References

    • ▸ Hubert, M., Rousseeuw, P.J., Van den Bossche, W. (2020). MacroPCA: an all-in-one PCA method allowing for missing values as well as cellwise and rowwise outliers. Technometrics, to appear. Open access from https:/www.tandfonline.com/doi/full/10.1080/00401706.2018.1562989 .
    • ▸ Maechler, M., Rousseeuw, P., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T., Koller, M., Conceicao, E., Di Palma, M. (2019), Robustbase: Basic Robust Statistics. R package, https://CRAN.R-project.org/package=robustbase .
    • ▸ Raymaekers, J., Rousseeuw, P.J. (2019). A generalized spatial sign covariance matrix. Journal of Multivariate Analysis, 171, 94-111. Open access from https://www.sciencedirect.com/science/article/pii/S0047259X18302410?via%3Dihub
    • ▸ Raymaekers, J., Rousseeuw, P., Van den Bossche, W., Hubert, M. (2019). CellWise: Analyzing Data with Cellwise Outliers. R package. https://CRAN.R-project.org/package=cellWise .
    • ▸ Rousseeuw, P.J., Hubert, M. (2018). Anomaly detection by robust statistics. WIREs Data Mining and Knowledge Discovery, 8 (2), e1236. Open access from https://onlinelibrary.wiley.com/doi/full/10.1002/widm.1236 .
    • ▸ Rousseeuw, P.J., Leroy, A. (1987). Robust Regression and Outlier Detection. New York: Wiley-Interscience.
    • ▸ Rousseeuw, P.J., Van den Bossche, W. (2018). Detecting deviating data cells. Technometrics, 60, 135-145. Open access from https://www.tandfonline.com/doi/full/10.1080/00401706.2017.1340909 .

    Pre-requisites

    Some basic knowledge of linear regression and principal components is helpful.

    Short Bio

    Peter Rousseeuw obtained his PhD at ETH Zurich, Switzerland and afterward became a professor in universities in The Netherlands, Switzerland and Belgium. For over a decade he worked full-time at Renaissance Technologies in the US. Currently he is Professor at KU Leuven in Belgium. His main research topics are cluster analysis (unsupervised classification) and anomaly detection by robust fitting, always with a focus on methodology as well as efficient algorithms and practical implementation. His work has been cited over 70,000 times. For more information see https://en.wikipedia.org/wiki/Peter_Rousseeuw and https://scholar.google.com/citations?user=5LMM6rsAAAAJ&hl=en .



    Asim Roy
    (Arizona State University) [intermediate]
    Hardware-based (GPU, FPGA based) Machine Learning That Exploits Massively Parallel Computing – An Overview of Concepts, Architectures and Neural Network Algorithm Implementation

    Summary

    Deep learning owes its success to the advent of massively parallel computing enabled by FPGAs (Field Programmable Gate Arrays), GPUs (Graphical Processing Units) and other special processors. However, many other neural network architectures can exploit such massively parallel computing. In this course, I will introduce the basic concepts and architectures of heterogeneous computing using FPGAs and GPUs. There are two basic languages for programming such hardware – OpenCL for FPGAs (from Intel, Xilinx and others) and CUDA for Nvidia GPUs. I will introduce the basic features of these languages and show how to implement parallel computations in these languages.

    In the second part of this course, I will show how to implement some basic neural architectures on this kind of hardware. In addition, we can do much more with such hardware including feature selection, hyperparameter tuning and finding a good neural architecture. Finding the best combination of features, the best neural network design and the best hyperparameters is critical to neural networks. With the availability of massive parallelism, it is relatively easy now to explore, in parallel, many different combinations of features, neural network designs and hyperparameters.

    In the last part of the course, I will discuss why it is becoming important that machine learning for IoT be at the edge of IoT instead of the cloud and how FPGAs and GPUs can facilitate that. And not just IoT, but in a wide range of application domains, from robotics to remote patient monitoring, localized machine learning from streaming sensor data is becoming increasingly important. GPUs, in particular, are available in a wide range of capabilities and prices and one can use them in many such applications where localized machine learning is desirable.

    Syllabus

    • ►Lecture 1: Massively parallel, heterogeneous computing using FPGAs and GPUs – heterogeneous computing concepts and architectures; comparison of FPGAs and GPUs; programming languages for parallel computing (OpenCL, CUDA)

    • ►Lecture 2: Implementation of basic neural network algorithms on FPGAs and GPUs exploiting massive parallelism; exploiting massive parallelism to explore different feature combinations and neural network designs and for hyperparameter tuning

    • ►Lecture 3: Machine learning at the edge of IoT in real-time from streaming sensor data using FPGAs and GPUs – classification, function approximation, clustering, anomaly detection

    References

    • ► 1. Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., & Dongarra, J. (2012). From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Computing, 38(8), 391-407.
    • ► 2. Karimi, K., Dickson, N. G., & Hamze, F. (2010). A performance comparison of CUDA and OpenCL. arXiv preprint arXiv:1005.2581.
    • ► 3. Lacey, G., Taylor, G. W., & Areibi, S. (2016). Deep learning on fpgas: Past, present, and future. arXiv preprint arXiv:1602.04283.
    • ► 4. Li, H., Ota, K., & Dong, M. (2018). Learning IoT in edge: Deep learning for the Internet of Things with edge computing. IEEE Network, 32(1), 96-101.
    • ► 5. Martinez, G., Gardner, M., & Feng, W. C. (2011, December). CU2CL: A CUDA-to-OpenCL translator for multi-and many-core architectures. In 2011 IEEE 17th International Conference on Parallel and Distributed Systems (pp. 300-307). IEEE.
    • ► 6. Misra, J., & Saha, I. (2010). Artificial neural networks in hardware: A survey of two decades of progress. Neurocomputing, 74(1-3), 239-255.
    • ► 7. Nurvitadhi, E., Venkatesh, G., Sim, J., Marr, D., Huang, R., Ong Gee Hock, J., ... & Boudoukh, G. (2017, February). Can FPGAs beat GPUs in accelerating next-generation deep neural networks?. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 5-14). ACM.
    • ► 8. Oh, K. S., & Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6), 1311-1314.
    • ► 9. Omondi, A. R., & Rajapakse, J. C. (Eds.). (2006). FPGA implementations of neural networks (Vol. 365, p. 6). Dordrecht, The Netherlands: Springer.
    • ► 10. Ortega-Zamorano, F., Jerez, J. M., Munoz, D. U., Luque-Baena, R. M., & Franco, L. (2015). Efficient implementation of the backpropagation algorithm in FPGAs and microcontrollers. IEEE transactions on neural networks and learning systems, 27(9), 1840-1850.
    • ► 11. Verhelst, M., & Moons, B. (2017). Embedded deep neural network processing: Algorithmic and processor techniques bring deep learning to iot and edge devices. IEEE Solid-State Circuits Magazine, 9(4), 55-65.
    • ► 12. Zhu, J., & Sutton, P. (2003, September). FPGA implementations of neural networks–a survey of a decade of progress. In International Conference on Field Programmable Logic and Applications (pp. 1062-1066). Springer, Berlin, Heidelberg.

    Pre-requisites

    Fundamentals of computer science, basic knowledge of neural networks

    Short Bio

    Asim Roy is a professor of information systems at Arizona State University. He earned his bachelor's degree from Calcutta University, his master's degree from Case Western Reserve University, and his doctorate from the University of Texas at Austin. He has been a visiting scholar at Stanford University and a visiting scientist at the Robotics and Intelligent Systems Group at Oak Ridge National Laboratory, Tennessee. Professor Roy serves on the Governing Board of the International Neural Network Society (INNS) and is currently its VP of Industrial Relations. He is the founder of two INNS Sections, one on Autonomous Machine Learning and the other on Big Data Analytics. He was the Guest Editor-in-Chief of an open access eBook Representation in the Brain of Frontiers in Psychology. He was also the Guest Editor-in-Chief of two special issues of Neural Networks - one on autonomous learning and the other on big data analytics. He is the Senior Editor of Big Data Analytics and serves on the editorial boards of Neural Networks and Cognitive Computation.

    He has served on the organizing committees of many scientific conferences. He started the Big Data conference series of INNS and was the General Co-Chair of the first one in San Francisco in 2015. He was the Technical Program Co-Chair of IJCNN 2015 in Ireland and the IJCNN Technical Program Co-Chair for the World Congress on Computational Intelligence 2018 (WCCI 2018) in Rio de Janeiro, Brazil. He is currently the IJCNN General Chair for WCCI 2020 in Glasgow, UK (https://www.wcci2020.org/). He is currently working on hardware-based (GPU, FPGA-based) machine learning for real-time learning from streaming data at the edge of the Internet of Things (IoT). He is also working on Explainable AI.



    Hanan Samet
    (University of Maryland) [introductory/intermediate]
    Sorting in Space: Multidimensional, Spatial, and Metric Data Structures for Applications in Spatial and Spatio-textual Databases, Geographic Information Systems (GIS), and Location-based Services

    Summary

    The representation of multidimensional, spatial, and metric data is an important issue in applications of spatial and spatiotextual databases, geographic information systems (GIS), and location-based services. Recently, there has been much interest in hierarchical data structures such as quadtrees, octrees, and pyramids which are based on image hierarchies, as well methods that make use of bounding boxes which are based on object hierarchies. Their key advantage is that they provide a way to index into space. In fact, they are little more than multidimensional sorts. They are compact and depending on the nature of the spatial data they save space as well as time and also facilitate operations such as search.

    We describe hierarchical representations of points, lines, collections of small rectangles, regions, surfaces, and volumes. For region data, we point out the dimension-reduction property of the region quadtree and octree. We also demonstrate how to use them for both raster and vector data. For metric data that does not lie in a vector space so that indexing is based simply on the distance between objects, we review various representations such as the vp-tree, gh-tree, and mb-tree. In particular, we demonstrate the close relationship between these representations and those designed for a vector space.
    For all of the representations, we show how they can be used to compute nearest objects in an incremental fashion so that the number of objects need not be known in advance. The VASCO JAVA applet is presented that illustrates these methods (found at http://www.cs.umd.edu/~hjs/quadtree/index.html). They are also used in applications such as the SAND Internet Browser (found at http://www.cs.umd.edu/~brabec/sandjava).

    The above has been in the context of the traditional geometric representation of spatial data, while in the final part we review the more recent textual representation which is used in location-based services where the key issue is that of resolving ambiguities. For example, does London'' correspond to the name of a person or a location, and if it corresponds to a location, which of the over 700 different instances ofLondon'' is it. The NewsStand system at newsstand.umiacs.umd.edu and the TwitterStand system at TwitterStand.umiacs.umd.edu system are examples. See also the cover article of the October 2014 issue of Communications of the ACM at http://tinyurl.com/newsstand-cacm or a cached version at http://www.cs.umd.edu/~hjs/pubs/cacm-newsstand.pdf and the accompanying video at https://vimeo.com/106352925

    Syllabus

      1. Introduction
    • a. Sample queries
    • b. Spatial Indexing
    • c. Sorting approach
    • d. Minimum bounding rectangles (e.g., R-tree)
    • e. Disjoint cells (e.g., R+-tree, k-d-B-tree)
    • f. Uniform grid
    • g. Location-based queries vs: feature-based queries
    • h. Region quadtree
    • i. Dimension reduction
    • j. Pyramid
    • k. Region quadtrees vs: pyramids
    • l. Space ordering methods
      1. Points
    • a. point quadtree
    • b. MX quadtree
    • c. PR quadtree
    • d. k-d tree
    • e. Bintree
    • f. BSP tree
      1. Lines
    • a. Strip tree
    • b. PM1 quadtree
    • c. PM2 quadtree
    • d. PM3 quadtree
    • e. PMR quadtree
      1. Rectangles and arbitrary objects
    • a. MX-CIF quadtree
    • b. Loose quadtree
    • c. Partition fieldtree
    • d. R-tree
      1. Surfaces and Volumes
    • a. Restricted quadtree
    • b. Region octree
    • c. PM octree
      1. Metric Data
    • a. vp-tree
    • b. gh-tree
    • c. mb-tree
      1. Operations
    • a. Incremental nearest object location
    • b. Boolean set operations
      1. Spatial Database Issues
    • a. General issues
    • b. Specific issues
      1. Indexing for spatiotextual databases and location-based services delivered on platforms such as smart phones and tablets
    • a. Incorporation of spatial synonyms in search engines
    • b. Toponym recognition
    • c. Toponym resolution
    • d. Spatial reader scope
    • e. Incorporation of spatiotemporal data
    • f. System integration issues
      1. Example systems
    • a. SAND internet browser
    • b. JAVA spatial data applets
    • c. STEWARD
    • d. NewsStand on a smartphone
    • e. TwitterStand

    References

      1. H. Samet. ``Foundations of Multidimensional Data Structures.'' Morgan-Kaufmann, San Francisco, 2006.
      1. H. Samet. ``A sorting approach to indexing spatial data.'' International Journal of Shape Modeling. 14(1):15--37, 28(4):517--580, June 2008.
      1. G. R. Hjaltason and H. Samet. ``Index-driven similarity search in metric spaces.'' ACM Transactions on Database Systems, 28(4):517--580, December 2003.
      1. G. R. Hjaltason and H. Samet. ``Distance browsing in spatial databases.'' ACM Transactions on Database Systems, 24(2):265--318, June 1999. Also Computer Science TR-3919, University of Maryland, College Park, MD.
      1. G. R. Hjaltason and H. Samet. ``Ranking in spatial databases.'' In Advances in Spatial Databases --- 4th International Symposium, SSD'95, M. J. Egenhofer and J. R. Herring, eds., Portland, ME, August 1995, 83--95. Also Springer-Verlag Lecture Notes in Computer Science
      1. H. Samet. ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS.'' Addison-Wesley, Reading, MA,
      1. H. Samet. ``The Design and Analysis of Spatial Data Structures.'' Addison-Wesley, Reading, MA, 1990.
      1. C. Esperanca and H. Samet. ``Experience with SAND/Tcl: a scripting tool for spatial databases.'' Journal of Visual Languages and Computing, 13(2):229--255, April 2002.
      1. H. Samet, H. Alborzi, F. Brabec, C. Esperanca, G. R. Hjaltason, F. Morgan, and E. Tanin. ``Use of the SAND spatial browser for digital government applications.'' Communications of the ACM, 46(1):63--66, January 2003.
      1. B. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. ``NewsStand: A new view on news.'' Proceedings of the 16th ACM SIGSPATIAL International Conference o n Advances in Geographic Information Systems, Irvine, CA, November 2008, 144--153. SIGSPATIAL 10-Year Impact Award.
      1. H. Samet, J. Sankaranarayanan, M. D. Lieberman, M. D. Adelfio, B. C. Fruin, J. M. Lotkowski, D. Panozzo, J. Sperling, and B. E. Teitler. ``Reading news with maps by exploiting spatial synonyms.'' Communications of the ACM, 57(10):64--77, October 2014.
      1. J. Sankaranarayanan, H. Samet, B. Teitler, M. D. Lieberman, and J. Sperling. ``TwitterStand: News in tweets.'' Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, November 2009, 42--51.
      1. M. D. Lieberman, H. Samet, and J. Sankaranarayanan. ``Geotagging with local lexicons to build indexes for textually-specified spatial data.'' Proceedings of the 26th IEEE International Conference on Data Engineering, Long Beach, CA, March 2010, 201--212.
      1. M. D. Lieberman and H. Samet. ``Multifaceted Toponym Recognition for Streaming News.'' Proceedings of the ACM SIGIR Conference. Beijing, July 2011, 843--852.
      1. M. D. Lieberman and H. Samet. ``Adaptive Context Features for Toponym Resolution in Streaming News.'' Proceedings of the ACM SIGIR Conference. Portland, OR, August 2012, 731--740.
      1. M. D. Lieberman and H. Samet. Supporting Rapid Processing and Interactive Map-Based Exploration of Streaming News. Proceedings of the 20th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Redondo Beach, CA, November 2012, 179--188/
      1. H. Samet, B. C. Fruin, and S. Nutanong. Duking it out at the smartphone mobile app mapping API corral: Apple, Google, and the competition. In Proceedings of the 1st ACM SIGSPATIAL International Workshop on Mobile Geographic Information Systems (MobiGIS 2012), Redondo Beach, CA, November 2012.
      1. H. Samet, S. Nutanong, and B. C. Fruin. Dynamic presentation consistency issues in smartphone mapping apps. Communications of the ACM, 59(9):58--67, September 2016.
      1. H. Samet, S. Nutanong, and B. C. Fruin. Static presentation consistency issues in smartphone mapping apps. Communications of the ACM, 59(5):88--98, May 2016.
      1. G. Quercini, H. Samet, J. Sankaranarayanan, and M.D. Lieberman. Determining the spatial reader scopes of news sources using local lexicons. In Proceedings of the 18th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, November 2010, 43--52,
      1. Spatial Data Structure applets at; http://www.cs.umd.edu/~hjs/quadtree/index.html.

    Pre-requisites

    Practitioners working in the areas of big spatial data and spatial data science that involve spatial databases, geographic information systems, and location-based services will be given a different perspective on data structures found to be useful in most applications. Familiarity with computer terminology and some programming experience is needed to follow this course.

    Short Bio

    Hanan Samet (http://www.cs.umd.edu/~hjs/) is a Distinguished University Professor of Computer Science at the University of Maryland, College Park and is a member of the Institute for Computer Studies. He is also a member of the Computer Vision Laboratory at the Center for Automation Research where he leads a number of research projects on the use of hierarchical data structures for database applications, geographic information systems, computer graphics, computer vision, image processing, games, robotics, and search. He received the B.S. degree in engineering from UCLA, and the M.S. Degree in operations research and the M.S. and Ph.D. degrees in computer science from Stanford University. His doctoral dissertation dealt with proving the correctness of translations of LISP programs which was the first work in translation validation and the related concept of proof-carrying code.
    He is the author of the recent book Foundations of Multidimensional and Metric Data Structures'' (http://www.cs.umd.edu/~hjs/multidimensional-book-flyer.pdf) published by Morgan-Kaufmann, an imprint of Elsevier, in 2006, an award winner in the 2006 best book in Computer and Information Science competition of the Professional and Scholarly Publishers (PSP) Group of the American Publishers Association (AAP), and of the first two books on spatial data structuresDesign and Analysis of Spatial Data Structures'', and ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS,'' both published by Addison-Wesley in 1990. He is the Founding Editor-In-Chief of the ACM Transactions on Spatial Algorithms and Systems (TSAS), the founding chair of ACM SIGSPATIAL, a recipient of a Science Foundation of Ireland (SFI) Walton Visitor Award at the Centre for Geocomputation at the National University of Ireland at Maynooth (NUIM), 2009 UCGIS Research Award, 2010 CMPS Board of Visitors Award at the University of Maryland, 2011 ACM Paris Kanellakis Theory and Practice Award, 2014 IEEE Computer Society Wallace McDowell Award, and a Fellow of the ACM, IEEE, AAAS, IAPR (International Association for Pattern Recognition), and UCGIS (University Consortium for Geographic Information Science). He was recently elected to the SIGGRAPH Academy. He received best paper awards in the 2007 Computers & Graphics Journal, the 2008 ACM SIGMOD and SIGSPATIAL ACMGIS Conferences, the 2012 SIGSPATIAL MobiGIS Workshop, and the 2013 SIGSPATIAL GIR Workshop, as well as a best demo award at the 2011 SIGSPATIAL ACMGIS'11 Conference. The 2008 ACM SIGSPATIAL ACMGIS best paper award winner also received the SIGSPATIAL 10-Year Impact Award. His paper at the 2009 IEEE International Conference on Data Engineering (ICDE) was selected as one of the best papers for publication in the IEEE Transactions on Knowledge and Data Engineering. He was elected to the ACM Council as the Capitol Region Representative for the term 1989-1991, and is an ACM Distinguished Speaker.



    Rory Smith
    (Monash University) [introductory/intermediate]
    Learning from Data, the Bayesian Way

    Summary

    What is the statistically optimal way to detect and extract information from signals in noisy data? After detecting ensembles of signals, what can we learn about the population of all the signals? This course will address these questions using the language of Bayesian inference. After reviewing the basics of Bayes theorem, we will frame the problem of signal detection in terms of hypothesis testing and model selection. Extracting information from signals will be cast in terms of computing posterior density functions of signal parameters. After reviewing model selection and parameter estimation, the course will focus on practical methods. Specifically, we will implement sampling algorithms which we will use to perform model selection and parameter estimation on signals in synthetic data sets. Finally, we will ask what can be learned about the population properties of an ensemble of signals. This population-level inference will be studied as a hierarchical inference problem.

    Syllabus

    • ▸The basics of Bayesian inference
    • ▸Parameter estimation, hypothesis testing and model selection
    • ▸Sampling methods: MCMC and Nested Sampling
    • ▸Illustrative examples: Detecting signals in noise using Bayesian inference
    • ▸Illustrative examples: Performing parameter estimation to learn about a signal’s properties
    • ▸Hierarchical inference: Using “Hyper-parameter” estimation to learn about populations of signals

    References

    • ▸Probability Theory: The Logic of Science, E. T. Jaynes, Cambridge University Press
    • ▸Nested sampling for general Bayesian computation, J. Skilling, Bayesian Analysis (2006)
    • ▸For an overview of Markov Chain Monte Carlo (MCMC), see e.g., Bayesian Data Analysis, Andrew Gelman, Chapman & Hall
    • ▸For a practical implementation of MCMC, see e.g., emcee, D. Foreman-Mackey, https://dfm.io/emcee/current/

    Pre-requisites

    Basic probability theory, sampling, python and jupyter hub

    Short Bio

    Dr. Rory Smith is a lecturer in physics at Monash University in Melbourne, Australia. From 2013-2017, he was a senior postdoctoral fellow at the California Institute of Technology where he worked on searches for gravitational waves. Dr. Smith participated in the landmark first detection of gravitational waves for which the 2017 Nobel Prize in physics was awarded. Dr. Smith’s research focuses on detecting astrophysical gravitational-wave signals from black holes and neutron stars, and extracting the rich astrophysical information encoded within to study the fundamental nature of spacetime.



    Jaideep Srivastava
    (University of Minnesota) [introductory/intermediate]
    Social Computing

    Summary

    Social Computing is an emerging discipline, and just like any discipline at a nascent stage it can often mean different things to different people. However, there are three distinct threads that are emerging. First thread is often called Socio-Technical Systems, which focuses on building systems that allow large scale interactions of people, whether for a specific purpose or in general. Examples include social networks like Facebook and Google Plus, and Multi Player Online Games like World of Warcraft and Farmville. The second thread is often called Computational Social Science, whose goal is to use computing as an integral tool to push the research boundaries of various social and behavioral science disciplines, primarily Sociology, Economics, and Psychology. Third is the idea of solving problems of societal relevance using a combination of computing and humans. The goal of this course is to discuss, in a tutorial manner, through case studies, and through discussion, what Social Computing is, where it is headed, and where is it taking us.

    Syllabus: Lecture Outline (3 modules of 2 hrs each)

    • • Module 1: Introduction

      • • Introduction to Social Computing
      • • Changing paradigms in social science research
      • • Brief overview of CSS and Social Analytics
      • • Small groups and their evolution
    • • Module 2: Science

      • • The nature of online trust
      • • Influence and social capital in online networks
      • • Health assessment of social networks
    • • Module 3: Applications

      • • Social computing for citizen services
      • • Social Analytics for business applications
      • • Some thoughts on privacy

    Pre-Requisites

    This course is intended primarily for graduate students. Following are the potential audiences:

    • • Computer Science graduate students: All that is needed for this audience is interest in one of the themes of social computing
    • • Social Science graduate students: Some exposure to building models from data, at least what these techniques are and what they can do
    • • Management graduate students: Those with MIS focus

    References

    Will provide later.

    Short Bio

    Jaideep Srivastava (https://www.linkedin.com/in/jaideep-srivastava-50230/) is Professor of Computer Science at the University of Minnesota, where he directs a laboratory focusing on research in Web Mining, Social Analytics, and Health Analytics. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), and has been an IEEE Distinguished Visitor and a Distinguished Fellow of Allina’s Center for Healthcare Innovation. He has been awarded the Distinguished Research Contributions Award of the PAKDD, for his lifetime contributions to the field of machine learning and data mining. He has supervised 39 PhD dissertations, and over 65 MS theses. He has also mentored a number of post-doctoral fellows and junior scientists in the industry and academia. He has authored or co-authored over 420 papers in journals and conferences, and filed 8 patents. Seven of his papers have won best paper awards, and he has a Google Scholar citation count of over 25,649 and an h-index of 59 (https://scholar.google.com/citations?user=Y4J5SOwAAAAJ&hl=en&oi=ao).

    Dr. Srivastava’s research has been supported by a broad range of government agencies, including NSF, NASA, ARDA, DARPA, IARPA, NIH, CDC, US Army, US Air Force, and MNDoT; and industries, including IBM, United Technologies, Eaton, Honeywell, Cargill, Allina and Huawei. He is a regular participant in the evaluation committees of various US and international funding agencies, on the organizing and steering committees of various international scientific forums, and on the editorial boards of a number of journals.

    Dr. Srivastava has significant experience in the industry, in both consulting and executive roles. Most recently he was the Chief Scientist for Qatar Computing Research Institute (QCRI), which is part of Qatar Foundation. Earlier, he was the data mining architect for Amazon.com (www.amazon.com), built a data analytics department at Yodlee (www.yodlee.com), and served as the Chief Technology Officer for Persistent Systems (www.persistentsys.com). He has provided technology and strategy advice to Cargill, United Technologies, IBM, Honeywell, KPMG, 3M, TCS, Cargill and Eaton. Dr. Srivastava Co-Founded Ninja Metrics (www.ninjametrics.com), based on his research in behavioral analytics. He was advisor and Chief Scientist for CogCubed (www.cogcubed.com), an innovative company with the goal to revolutionize the diagnosis and therapy of cognitive disorders through the use of online games, which was subsequently acquired by Teladoc (https://www.teladoc.com/), and for Jornaya (https://www.jornaya.com/). He is presently a technology advisor to a number of startups at various stages, including Kipsu (http://kipsu.com/) - which provides an innovative solution to improving service quality in the hospitality industry, and G2lytics (https://g2lytics.com/) – an organization that uses machine learning to identify tax compliance problems.

    Dr. Srivastava has held distinguished professorships at Heilongjiang University and Wuhan University, China. He has held advisory positions with the State of Minnesota, and the State of Maharashtra, India. He is an advisor to the Unique ID (UID) project of the Government of India, whose goal is to provide biometrics-based social security numbers to the 1.25+ billion citizens of India.

    Dr. Srivastava has delivered over 170 invited talks in over 35 countries, including more than a dozen keynote addresses at major international conferences. He has a Bachelors of Technology from the Indian Institute of Technology (IIT), Kanpur, India, and MS and PhD from the University of California, Berkeley.



    Mayte Suárez-Fariñas
    (Icahn School of Medicine at Mount Sinai) [intermediate/advanced]
    Meta-analysis Methods for High-dimensional Data


    Jeffrey Ullman
    (Stanford University) [introductory]
    Big-data Algorithms That Aren't Machine Learning

    Summary

    We shall study algorithms that have been found useful in querying large datasets. The emphasis is on algorithms that cannot be considered "machine learning."

    Syllabus

    • ►Locality-sensitive hashing: shingling, minhashing, applications;
    • ►Stream-processing algorithms: counting occurrences, counting unique values, sampling;
    • ►Graph-processing algorithms: social networks, disjoint and overlapping communities, counting neighborhoods, counting triangles, transitive closure.

    Prerequisites

    A course in algorithms at the advanced-undergraduate level is important.

    References

    We will be covering (parts of) Chapters 3, 4, and 10 of the free text: Mining of Massive Datasets (third edition) by Jure Leskovec, Anand Rajaraman, and Jeff Ullman, available at www.mmds.org

    Short Bio

    A brief on-line bio is available at i.stanford.edu/~ullman/pub/opb.txt



    Wil van der Aalst
    (RWTH Aachen University) [introductory/intermediate]
    Process Mining: A Very Different Kind of Machine Learning That Can Be Applied in Any Organization

    Summary

    Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. The use of process mining is rapidly increasing and there are over 30 commercial vendors of process mining software. Through concrete data sets and easy-to-use software, the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.

    The course explains the key analysis techniques in process mining. Participants will learn various process discovery algorithms. These can be used to automatically learn process models from raw event data. Various other process analysis techniques that use event data will be presented. Moreover, the course will provide easy-to-use software, real-life data sets, and practical skills to directly apply the theory in a variety of application domains.

    Process mining provides not only a bridge between data mining and business process management; it also helps to address the classical divide between "business" and "IT". Evidence-based business process management based on process mining helps to create a common ground for business process improvement and information systems development.

    Note that Gartner identified process-mining software as a new and important class of software. On can witness the rapid uptake looking at the successful vendors (e.g., Celonis, Disco, ProcessGold, myInvenio, PAFnow, Minit, QPR, Mehrwerk, Puzzledata, LanaLabs, StereoLogic, Everflow, TimelinePI, Signavio, and Logpickr) and the organizations applying process mining at a large scale with thousands of users (e.g., Siemens and BMW). Yet many traditional mainstream-oriented data scientists (machine learners and data miners) are not aware of this. This explains the relevance of the course for BigDat 2020 participants.

    Syllabus

    The course focuses on process mining as the bridge between data science and process science. The course will introduce the three main types of process mining.

      1. The first type of process mining is discovery. A discovery technique takes an event log and produces a process model without using any a-priori information. An example is the Alpha-algorithm that takes an event log and produces a process model (a Petri net) explaining the behavior recorded in the log.
      1. The second type of process mining is conformance. Here, an existing process model is compared with an event log of the same process. Conformance checking can be used to check if reality, as recorded in the log, conforms to the model and vice versa.
      1. The third type of process mining is enhancement. Here, the idea is to extend or improve an existing process model using information about the actual process recorded in some event log. Whereas conformance checking measures the alignment between model and reality, this third type of process mining aims at changing or extending the a-priori model. An example is the extension of a process model with performance information, e.g., showing bottlenecks. Process mining techniques can be used in an offline, but also online setting. The latter is known as operational support. An example is the detection of non-conformance at the moment the deviation actually takes place. Another example is time prediction for running cases, i.e., given a partially executed case the remaining processing time is estimated based on historic information of similar cases.

    The course uses many examples using real-life event logs to illustrate the concepts and algorithms. After taking this course, one is able to run process mining projects and have a good understanding of the Business Process Intelligence field.

    References

    W.M.P. van der Aalst. Process Mining: Data Science in Action. Springer-Verlag, Berlin, 2016. (The course will also provide access to slides, several articles, software tools, and data sets.)

    Pre-requisites

    This course is aimed at both students (Master or PhD level) and professionals. A basic understanding of logic, sets, and statistics (at the undergraduate level) is assumed. Basic computer skills are required to use the software provided with the course (but no programming experience is needed). Participants are also expected to have an interest in process modeling and data mining but no specific prior knowledge is assumed as these concepts are introduced in the course.

    Short Bio

    Prof.dr.ir. Wil van der Aalst is a full professor at RWTH Aachen University leading the Process and Data Science (PADS) group. He is also part-time affiliated with the Fraunhofer-Institut f¸r Angewandte Informationstechnik (FIT) where he leads FIT's Process Mining group and the Technische Universiteit Eindhoven (TU/e). Until December 2017, he was the scientific director of the Data Science Center Eindhoven (DSC/e) and led the Architecture of Information Systems group at TU/e. Since 2003, he holds a part-time position at Queensland University of Technology (QUT). Currently, he is also a distinguished fellow of Fondazione Bruno Kessler (FBK) in Trento and a member of the Board of Governors of Tilburg University. His research interests include process mining, Petri nets, business process management, workflow management, process modeling, and process analysis. Wil van der Aalst has published over 220 journal papers, 20 books (as author or editor), 500 refereed conference/workshop publications, and 75 book chapters. Many of his papers are highly cited (he one of the most cited computer scientists in the world; according to Google Scholar, he has an H-index of 144 and has been cited over 96,000 times) and his ideas have influenced researchers, software developers, and standardization committees working on process support. Next to serving on the editorial boards of over ten scientific journals, he is also playing an advisory role for several companies, including Fluxicon, Celonis, Processgold, and Bright Cape. Van der Aalst received honorary degrees from the Moscow Higher School of Economics (Prof. h.c.), Tsinghua University, and Hasselt University (Dr. h.c.). He is also an elected member of the Royal Netherlands Academy of Arts and Sciences, the Royal Holland Society of Sciences and Humanities, and the Academy of Europe. In 2018, he was awarded an Alexander-von-Humboldt Professorship.