Basics of Machine Learning

Overview

Teaching: 25 min
Exercises: 120 min

Questions

What type of research problems can machine learning solve, and what approach do models use to accomplish this?

How does a machine learning model make predictions from its inputs?

What are a set of key steps (also called a workflow) for solving a machine learning research problem?

Objectives

Students can describe the structure of several basic machine learning model types, the types of predictions that can be made, and metrics for assessing regression and classification performance

Students will practice using jupyter notebooks to execute machine learning workflows on cloud computing platforms which they can transfer to other projects.

Students will assess model performance on two potential applications and justify which task the model is appropriate for

Background

In this introductory lecture we give a summary of basic machine learning model types, descriptions of input features, data types, classes of models, and a description of several overarching ideas in machine learning.

Recorded Lecture

You may also view the Lecture Slides and Lecture Notes in addition to the recorded lecture.

Readings

Here are a few nice introductory articles discussing machine learning basics. They cover the topic from slightly different perspectives, and go into some concepts which will be covered in more detail in other modules.

Towards Data Science Article

Machine Learning Mastery Article

Activity: Introduction to Machine Learning for Materials Science

This activity will give a hands on example of executing a machine learning workflow for training a model to predict materials properties. The activity runs through a Jupyter Notebook hosted on the Nanohub platform. This notebook contains instructions, code, and exercises to read, execute, and answer within the notebook itself. You can begin the activity by Launching the tool on Nanohub Here. In order to complete this activity you will need to either create an account or log in with an existing google account to be able to run the tool.

If you’d like you may also follow along with a video walkthrough which discusses the main sections of the notebook activity.

Important!
Saving your work in Nanohub is a bit tricky. Follow these instructions to create a copy of the notebook once you’ve launched the tool via the link.

With your notebook open (by clicking the launch tool button) make a copy if you haven’t already using the File drop down menu in the top left.

The copy will open in a new tab in your browser. Save the copy by clicking “save and checkpoint” under the File drop down menu in the new tab.

At this point you may close the original tab as we won’t use it anymore.

To find the newly copied file (to download or open it up later) we will use the Jupyter tool on Nanohub. Launch this tool which will give us access to the virtual computer that Nanohub is hosting for us.

Navigate to the folder “data/results/####/intromllab/bin/”. Note: the number will be a unique number associated with your session. Pick the most recent one if there are multiple (or an older one if you are looking for an older saved file).

Double click the notebook file “intromllab-Copy1.ipynb” to launch your saved notebook. This is the newly copied notebook file that you just created.

You can now save this notebook freely, and to return to this notebook later launch the same tool as step 2 and follow the instructions from there.

Note: do not move the notebook or save to another location on nanohub’s virtual computer or else some aspects of the code will break.

Follow the Link to get started if you did not click above.

Extra Help

If you are starting with little python backgroud you may find it useful to review the introductory module on python programming via software carpentry.

Key Points

Machine learning (specifically supervised learning) can be used to model complex materials properties that are hard to obtain experimentally

Machine learning models make predictions by learning relationships and patterns from existing data, and use those learned patterns to make predictions of properties of new materials

A workflow for building machine learning models can be broken down into key steps: Data Cleaning, Feature Generation, Feature Engineering, Model Assessment, Model Optimization, and Model Predictions

Establishing Research Workflows

Overview

Teaching: 15 min
Exercises: 120 min

Questions

What is MAST-ML, and how can it be setup to run on Google Colab?

What is the process of using MAST-ML to execute machine learning workflows?

What advantages does MAST-ML have in developing machine learning models for informatics research?

Objectives

Using the MAST-ML software, students will execute common machine learning workflow steps: data cleaning, feature generation, feature engineering, model assessment through cross validation, model optimization, and model predictions.

Students will compare error metrics from K-fold cross validation using different numbers of folds and assess the effects on reported error metrics.

Background

The MAterial Science Toolkit for Machine Learning (MAST-ML) is a machine learning workflow software package built in python that leverages several other underlying python libraries to enable users to build and test models with little to no python background. The toolkit generates common outputs such as parity plots, learning curves, and standard error metrics which enables new users to easily get to the information needed to make informed decisions about how their models are performing.

Google Colab is a cloud computing resource that allows us to run software on the cloud instead of our local computer, giving access to more computing power and a standardized computing environment. We’re going to use Google Colab to run the the MAST-ML software in a series of tutorials that demonstrate how to perform a basic machine learning workflow.

MAST-ML is divided into a number of sections which allow users to perform ML workflows. These include: Data Importation, Data Cleaning, Feature Generation, Feature Engineering, Data Splitting (Train/Test/Validation), Model Assessment, and Model Optimization By giving the user freedom to choose methods at each of these workflow steps MAST-ML is a extremely customizable in its use compared to some other software tools which are much more structured in the methods used.

For those with more interest in the python programming itself

If you have a little more programming background you may be interested to check out the github page for the code below to see a bit about how it’s structured. There is usually ongoing efforts in the Informatics Skunkworks group at UW-Madison to improve and add features to MAST-ML so if this is something you might be interested in let us know!

For more detailed information on the structure of MAST-ML please see the MAST-ML Documentation.
There are also several other Jupyter Notebooks included there that demonstrate additional functionality in MAST-ML that is not covered in this module.

Activity: Your First MAST ML Run

It’s time to conduct your very first MAST-ML run! For this first run, we will set up a notebook in Google Colab and demonstrate how to call the MAST-ML package to perform machine learning and data science workflow steps.

Login to Google Colab in your browser. NOTE: Each time you return to Google Colab you will need to rerun the setup instructions as Colab will delete your previous work after signing off.

Download the MASTML_Colab.ipynb notebook to your local computer. This notebook contains python code to install MAST-ML in the Colab environment and introduces you to the basic ways we can interact with MAST-ML to perform machine learning workflows.

Upload the MASTML_Colab.ipynb file to Colab using File -> Upload Notebook.

With the notebook uploaded you should see something like this, from here you should be able to follow along with the instructions in the notebook:

Key Points

MAST-ML is a machine learning workflow package that enables rapid iteration of model building and analysis due to the ability to customize each of the key steps in a machine learning workflow.

With the notebook we’ve provided MAST-ML can be installed on Google Colab

In this activity users have run one example workflow with one set of choices at each step. Users can then change each step to explore other workflow versions and choices.

Comparing Model Types

Overview

Teaching: 15 min
Exercises: 120 min

Questions

How does MASTML enable users to investigate different options within a machine learning workflow?

Specifically, how do users change the model type, hyperparameters, and define a grid search of hyperparameters?

Objectives

Students will recreate workflows in MAST-ML from other machine learning software in (like Citrination or other base ML software packages like scikit-learn).

Students will practice modifying the model type and hyperparameters within a machine learning workflow in MAST-ML by editing code in a Jupyter Notebook.

Students will compare a random forest model that’s coded into the Jupyter Notebook to a model of their choice and assess the relative performance of the model types.

Background

This module serves as a more detailed exploration of MAST-ML that was introduced in Module 3: Introduction to MAST-ML. In that module students learn the basics of running MAST-ML through a Jupyter Notebook. In this module we will expand upon that initial understanding and practice making the kinds of edits and modifications to a workflow that one would make when using MAST-ML in a research project.

For those with more interest in the python programming itself

If you have a little more programming background you may be interested to check out the github page for the code below to see a bit about how it’s structured. There is usually ongoing efforts in the Informatics Skunkworks group at UW-Madison to improve and add features to MAST-ML so if this is something you might be interested in let us know!

Readings

For more detailed information on the structure of MAST-ML please see the MAST-ML Documentation. There are also several other Jupyter Notebooks included there that demonstrate additional functionality in MAST-ML that is not covered in this module.

Activity: Modifying Machine Learning Workflows with MASTML

For this activity we will set up a notebook in Google Colab and demonstrate how to call the MAST-ML package to perform machine learning and data science workflow steps. We will then go through how to make edits to this notebook that enable users to modify each step in the workflow.

We will be using Google Colab to perform all of our computing so to begin download the the .ipynb notebook file, the bandgap_data_v2.csv, and the generated_features.xlsx files to your local computer.

Login to Google Colab in your browser. NOTE: Each time you return to Google Colab you will need to rerun the setup instructions as Colab will delete your previous work after signing off.

Upload the notebook file using the File dropdown menu option “upload notebook”, or the popup you get when first opening Colab.

With the notebook running then upload the data file “bandgap_data_v2.csv” and “generated_features.xlsx” using the upload button highlighted in red below.

All further instructions are contained within the notebook file directly. You can begin running code in the notebook by pressing “shift+enter” when selecting any of the code cells.

Note

For this activity there are references to a previous workflow that was performed in the ML lab activity hosted on nanohub
If you haven’t gone through that activity it may be useful to go through both side by side. One goal of this activity is to show how MASTML can be used to mirror previous machine learning workflows, so seeing them both happen side by side could be a useful way to interact with both.

Key Points

MASTML is divided into seperate sections which execute key steps in a machine learning workflow. By changing individual steps with a few lines of code we can change settings and configurations at each step.

In the exercises we demonstrate changes to the model type and hyperparemeters. Additionally changes can be made to data cleaning, feature generation/engineering, model assessment by making similar edits in the notebook.

Optimizing Model Hyperparameters

Overview

Teaching: 15 min
Exercises: 120 min

Questions

How does MAST-ML enable users to modify the hyperparameters of a model?

Buidling off of the idea of using a grid search of hyperparameters, how can users converge on an optimized model?

Objectives

Students learn about how model hyperparameters can affect performance and are introduced to some basic ideas on how these hyperparameters can be optimized, namely the grid search method.

Students explore various methods of searching for a global minimum in model errors and avoiding getting trapped in a local minimum.

Students run a neural network within MAST-ML. They can describe what they are, how to use them, and how to optimize them.

Background

During other modules included in this curriculum we’ve introduced the basic idea of optimizing models through modifying hyperparameters of the model. Hyperparameters are parameters that affect how the model learns and put constraints on the learning process. In other activities this optimization step has usually been limited to a single optimization step in which a grid of an example hyperparameter is generated and the best performing model from that grid is identified as the optimized model.

In this module we’ll spend a bit more time looking in more detail at an optimization procedure that goes beyond the very basic introduction. The main activity of this module will use the MAST-ML software, and we’ll be primarily working with the multi-layer perceptron neural network model from Scikit-learn.

Readings

If you haven’t worked with neural networks before I’d recommend reviewing this very nice introduction series from the youtube channel 3blue1brown which not only covers the basic structure of neural networks but also how they learn while walking through a nice example problem.

Activity

This activity is broken down into two parts. The first is an introduction to Neural Networks which we will then use in the second part to explore optimizing a neural network.

For an interactive way get to know neural networks better also visit this neural network demo by tensorflow
1.1. Complete the activities related to the playground included in the ML crash course by google
1.2. (Optional) If you are looking to practice some programming you can also continue on to the related “programming exercise” that they include which will link you to google colab and provide its own instructions.

Colab activity We will be using Google Colab to perform all of our computing so to begin download the the
.ipynb notebook file,
the bandgap_data_v2.csv,
2.1. First start by uploading the .ipynb file to Colab
2.2. Then upload the data files to a folder in your google drive called “MASTML_colab”
2.3. The rest of the instructions for this activity are included within the notebook itself.

Note if you have completed previous modules

if you just completed the modifying workflows activity you may have this setup already and do not need to download or upload the data files as they already exist in your MASTML_colab folder

Key Points

A sequential grid search of candidate hyperparameter values can progressively search for the best combination of model hyperparameters.

Often time multiple grid searches are required to fully explore a hyperparameter space in sufficient detail.

If exploring higher numbers of hyperparameters (>4 or so) it may be better to use more sophisticated search techniques due to computational constraints.

Ethical Data Cleaning

Overview

Teaching: 15 min
Exercises: 20 min

Questions

When is it appropriate to modify data used in machine learning modeling?

When / What types of modifications are not OK?

Objectives

Students learn about responsible conduct of research practices when determining what data you can exclude in fitting and assessment with machine learning.

Background

We may not often think about it, but there are some important ethical issues when it comes to handling data, figures, and images. Here we will focus on ethical issues surrounding the practice of data cleaning in the context of machine learning where you are using computational algorithms to make predictions based on their experience with training datasets.

Obviously, fabricating data is wrong, but sometimes it may seem fuzzier when considering whether an action is data falsification. For instance, you may have what you believe to be outlier data points in a dataset. If you were an experimentalist and there is a clear documented reason (for instance a comment in your lab notebook about high room temperature due to broken HVAC in the building, a sample contamination, or a power fluctuation during data collection), then it is reasonable to exclude the data points corresponding to the documented problem. However, it can be a bit more challenging to know whether it is appropriate to exclude certain data points when you are using someone else’s data in a machine learning context. Nonetheless it is an important consideration and one that must be taken just as seriously as the experimentalist so that you are handling data in an ethical manner.

Imagine that you had concerns about some data points in a dataset you are using for training. Or you find that your model is identifying unanticipated problems with a few of the data points used during training. Ideally you would contact the individual(s) who provided the data to ask them about the details. If this is published data then you would identify the corresponding author of the publication. This individual can be contacted with questions about the paper if you are delving deeply into its content. Often on the first page, but sometimes at the end of the article, you will find contact information for the corresponding author. This contact information should not be used frivolously, but if you have a serious question that you have not been able to find the answer to in any other way, it is reasonable to contact the corresponding author. A conversation with a person more deeply knowledgeable about the data may provide you with the added insights you need in order to know how to handle it properly.

In other cases, you may be pooling data from a variety of sources and find a discrepancy. In this case you might find that there are multiple, widely varying values given for the same material conditions. It may be difficult or impossible to judge whether one (or more) of them is in error. If you have many values for the condition and one value appears to be an outlier, then you may be able to show statistically that the outlier does not fall within the data population thus giving you justification to exclude it (although you should report any statistical exclusions that you made when writing about your methods). If you only have a few values for the condition and they vary widely from each other, then you should exclude all of them rather than selecting the one that seems to “fit” or excluding the one that seems “suspicious” without any justification.

A common situation where data cleaning comes up is when a model has trouble fitting certain data points. For example, a model might give large errors when predicting values when the data points are left out during a cross validation test as validation data of, or even when included in, the training data. Let us call this set of points P, for “problematic”. It might be tempting to identify P as somehow incorrect and remove them. However, it is not usually clear if the large prediction errors in P are a sign of problems in the data or in the model. It is therefore not ethical to simply remove P. What is ethical is to look again at P to see if there might be good reasons to exclude them or correct them based on factors that might cause them to be wrong, e.g., errors in data entry, experimental methods, etc. It is also ethical to hypothesize that these data points might have errors and fit a model to the data without P and present these results, with clear statement that you have excluded P and that you are doing so based only on a speculative hypothesis, along with the fits to the data including P. By supplying this information, the user of your model can assess if they think your exclusion is reasonable and make an informed decision about whether to choose your model.

These are just a few examples that you may run into, but there are others. Your primary goal is to treat the data you are working with ethically and not do anything that others might construe as data manipulation. If in doubt, consult your mentor about how data should be handled when you are unsure.

Activity: Reflect on actions in “Introduction to Citrination” module

In the Introduction to Citrination activity you were instructed to remove a handful of points from the fatigue strength dataset to practice this skill.

Identify a possible justification for removing these data points and then describe your cleaning process as if you were explaining it to another researcher or colleague.

Describe two examples of when the removal of these data points would be considered unethical data manipulation.

Reference: W.C. Crone, Introduction to Engineering Research, Morgan & Claypool Publishers (2020).
(Note: This book is available as a free PDF download from the university library.)

Key Points

Removing or ignoring data simply because it performs poorly is not appropriate

Navigating Roadblocks and Obstacles

Overview

Teaching: 15 min
Exercises: 20 min

Questions

What do I do when I get stuck on an activity / assignment?

Objectives

Students identify a range of strategies from getting “unstuck” when tackling challenging problems and place these strategies in a hierarchy of actions that can be taken in a range of situations.

Background

Inevitably you will experience roadblock and setbacks – both in developing an understanding of machine learning and your application of it to research problems.

Research is inherently challenging because you are trying to do something that has not been done before. You may need to alter assumptions, reframe hypotheses, and develop new ideas to get yourself around roadblocks. These modified/new assumptions/hypotheses/ideas may be ones that you generate yourself, or they may be ones that you generate with your peer colleagues, research mentors, collaborators, and/or your research group. This is fundamentally a part of the process of research and sometimes these obstacles can be the very thing that lead you to an unexpected and fruitful outcome.

There are two common mistakes to avoid: First, trying to be so independent and self sufficient that you never ask for help and end up spinning your wheels endlessly. Second, immediately jumping for assistance without taking some time to try to do some independent thinking so that you can troubleshoot the problem yourself. It can be a tough balance to find the middle ground, but recognizing when is the appropriate time to ask for help is part of the learning process.

Because you will inevitably run into roadblocks and obstacles and you will have to think of creative ways to get around, over, or through them, it is important to develop a collection of strategies that might be useful to you in different situations and identify which ones would be best to try first and which to save to last.

Activity: Reflect on actions in “Introduction to Citrination” module

Consider the list of strategies below as a starting point and work with your group to add additional items and more detail (the who and how for your Skunkworks group).

Reread/review what is already available to you

Look for public online resources that might help (i.e. Google it)

Think about the problem sideways (i.e. look at it from another angle)

Take a short break and come back to it with fresh eyes

Break a big problem into small pieces and tackle each one separately

Interact with peer colleagues who may have dealt with something similar

Use your broader network

Interact with your group mentor

Seek out other subject experts

Identify ways you can move forward even in the face of uncertainty

After developing a more complete list, place them into three categories:

Things I can try to do independently:

If that doesn’t work, things I could try to do next:

If my other attempts fail, then try:

Reference: W.C. Crone, Introduction to Engineering Research, Morgan & Claypool Publishers (2020).
(Note: This book is available as a free PDF download from the university library.)

Key Points

Creating a Group Compact

Overview

Teaching: 15 min
Exercises: 20 min

Questions

How can we set up a new group research activity to have a positive experience for all members?

Objectives

Students learn about how to establish group expectations, dynamics, and communication

Background

The Group Compact is meant to help define expectations of the group relationship. It helps us think about goals, responsibilities, best practices, and will hopefully allow everyone to have a fulfilling and enjoyable time during their participation.

Activity: Reflect on actions in “Introduction to Citrination” module

Work together to develop content in the following suggested categories below. Sample questions to consider are provided to start the conversation. Consider this as simply a guide because you will want to adapt the Group Compact framework to meet your needs.

Consider what you develop to be a living document, meaning we want input from everyone as we continue to work together to further define and refine how the education group is run.

Group Goals:

What goals are we trying to achieve with our participation in Skunkworks?

Group Communication:

How will we communicate with each other?

When/where will we meet as a group?

How can I ask for help? What is the best way to ask questions?

Research Training and Professional Development:

What types of training materials will be provided?

How are the students expected to interact with the training materials?

Group Member Responsibilities:

How will we work together?

What are the student’s responsibilities in group meetings and outside of meetings?

What roles will the group mentor fulfill? If there is both a group mentor and faculty mentor, how are their roles defined?

Work Hours/Attendance:

How many hours a week are students expected to devote to Skunkworks activities?

What are my responsibilities for participating in the group?

Conflict Resolution:

How can we minimize conflict or deal with it early before it becomes a big problem?

If our attempts to resolve conflict don’t work, what do I do next?

For an example of a group compact document see the Fall 2021 Eduation Group Compact

Reference: W.C. Crone, Introduction to Engineering Research, Morgan & Claypool Publishers (2020).
(Note: This book is available as a free PDF download from the university library.)

Key Points

Establishing a group compact can enable group participants to engage with each other productively and positively.

Addressing Model Limitations

Overview

Teaching: 15 min
Exercises: 20 min

Questions

How do I properly assess my model’s performance and limitations when making new predictions?

Objectives

Students learn about the ethical limits of model applicability

Background

Responsible conduct of research requires that scientists and engineers uphold ethical standards in every aspect of their work. Although you are at the beginning of a research career at this point, the work you do will likely go on to have an impact on the field and the world around you. The long-term outcomes may have significant ramifications for products, structures, and people’s lives. As a young professional in your field you must attend to ethical standards for the societal good and for the sake of yourself and your coworkers. Not only is a negligent individual placing their own reputation at risk, but they can also damage the reputations of their colleagues and the broader researcher community.

In machine learning you will need to both understand and be forthright about the limits of your model. For instance, if a company developing machine learning for their self driving cars only trained its model on highway driving datasets, you would not be very comfortable getting in this car and letting it drive you through city streets. Driving is occurring in both circumstances, but we know from our personal experience that these two types of driving are quite different from each other. So you would not want the company to claim that their car is capable of self driving everywhere if their training set was limited to highway driving.

Another example to think about is the pricing of houses. If you were to develop a model predicting the sale price of houses in a medium sized city in the Midwestern U.S. (like Madison, WI), you would expect it would be quite an accurate model in that locale. If you used that same model on another Midwestern city with a large urban population (like Chicago, IL) then you would expect that model to be less predictive. You would expect even worse performance if you used this model on a city on the East Coast with an even larger urban population (like New York, NY). So even though your model is good in the vicinity of your training dataset, once you go outside that vicinity your model is no longer reliable.
Because you will inevitably run into roadblocks and obstacles and you will have to think of creative ways to get around, over, or through them, it is important to develop a collection of strategies that might be useful to you in different situations and identify which ones would be best to try first and which to save to last.

Machine learning cannot be ethically used to make predictions outside of its training scope. Consider mammography databases that are used diagnostically for breast cancer in women. These databases usually contain mammograms from disproportionately more white women than black women. If a machine learning model were developed to predict breast cancer by training on such a database, it would not predict as equitably for all patients (Stewart, 2019). For this reason, recent research using machine learning to detect breast cancer is training on data from both white women and black women. This is particularly important given that black women are 42% more likely to die from breast cancer (Conner-Simons and Gordon, 2019).

As you consider the data sets that you are employing, take the time to think about the applicability of the model you are developing and its limitations.

Activity: Reflect on actions in “Introduction to Citrination” module

Think about other situations in which models may be outside of their training scope. Add below with an example of a situation in which a model might be outside the scope of it’s training data. This may come from a personal experience, previous activities you’ve worked on, or an area of interest you’d like to work on in the future.

References:

Jeffrey Dastin, “Amazon scraps secret AI recruiting tool that showed bias against women,” Reuters, October 10, 2018. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G

Adam Conner-Simons and Rachel Gordon, “Using AI to predict breast cancer and personalize care,” MIT News, May 7, 2019. https://news.mit.edu/2019/using-ai-predict-breast-cancer-and-personalize-care-0507

Matthew Stewart, “The Limitations of Machine Learning,” Medium, July 29, 2019. https://towardsdatascience.com/the-limitations-of-machine-learning-a00e0c3040c6

W.C. Crone, Introduction to Engineering Research, Morgan & Claypool Publishers (2020). (Note: This book is available as a free PDF download from the university library.)

Key Points

The data used to train a ML model can impart implicit bias into the models predictions

Considering how representative the training data is of predictions that we want to make can help us understand where it is appropriate to make predictions

Introduction to Machine Learning For Engineering Research (ML4ER)

Basics of Machine Learning

Overview

Background

Readings

Activity: Introduction to Machine Learning for Materials Science

Extra Help

Key Points

Establishing Research Workflows

Overview

Background

For those with more interest in the python programming itself

Activity: Your First MAST ML Run

Key Points

Comparing Model Types

Overview

Background

For those with more interest in the python programming itself

Readings

Activity: Modifying Machine Learning Workflows with MASTML

Note

Key Points

Optimizing Model Hyperparameters

Overview

Background

Readings

Activity

Note if you have completed previous modules

Key Points

Ethical Data Cleaning

Overview

Background

Activity: Reflect on actions in “Introduction to Citrination” module

Key Points

Navigating Roadblocks and Obstacles

Overview

Background

Activity: Reflect on actions in “Introduction to Citrination” module

Key Points

Creating a Group Compact

Overview

Background

Activity: Reflect on actions in “Introduction to Citrination” module

Key Points

Addressing Model Limitations

Overview

Background

Activity: Reflect on actions in “Introduction to Citrination” module

Key Points