Distributed machine learning using Apache Spark for large-scale classification of cancer tumor gene mutations

In Part I, I discussed Exploratory Data Analysis and applying Pointwise Mutual Information to mutation pairs to find out whether there was any correlation between PMI scores and mutation class similarities. In this part, I will discuss training a distributed multinomial logistic regression (MLR) model and applying it to the test dataset to determine the classes of mutations.

Again, I am sharing only certain parts of the code in this article for the sake of brevity. Please check out the GitHub link here for the full code. I have also prepared a 5-minute video that you can find here.

Before…


Distributed machine learning using Apache Spark for large-scale classification of cancer tumor gene mutations

In this two-part post, I will share my learning from a term research project I worked on in a graduate course on Distributed Computing. I implemented a machine learning algorithm for classifying cancer tumor gene mutations using PySpark, the Python implementation of Apache Spark. In this part, I will describe Exploratory Data Analysis (EDA)along with the distributed implementation of the Natural Language Processing (NLP)concept of Pointwise Mutual Information (PMI).

I am sharing only certain parts of the code in this article as including the whole code…


Leveraging Efficientnet architecture to achieve 99%+ prediction accuracy on a Medical Imaging Dataset pertaining to Covid19

In this post, I will share my experience of developing a Convolutional Neural Network algorithm to predict Covid-19 from chest X-Ray images with high accuracy. I developed this algorithm while participating in an In-Class Kaggle competition for a Ph.D. level course on Deep Learning. I was happy to learn from the final competition Leaderboard that I stood 1st in the competition amongst 40 participating Ph.D. and Masters students of Statistics and Computer Science departments! Instead of moving straight to results, I will mention the…


Finding the optimal input parameters when the response value is unknown

In this post, I will describe my experience working on finding the maximum value in a multivariate optimization problem where the response value was unknown. I worked on this problem in the final project of a graduate course in Experimental Design that I took at the University of Waterloo

Context

Often, in scientific and engineering settings and sometimes even in business settings, we come across quantitative problems where we know the input values and their ranges and are the provided the corresponding output (response) values but have no idea about the relationship between them. When it can be safely assumed…


Algorithmic Risk Prediction for Life Insurance Applications through supervised learning algorithms — By Bharat , Dylan , Leonie and Mingdao (Jack)

In part 1, we described data pre-processing and dimensionality reduction for the Prudential Life Insurance Dataset. In this part we will describe the learning algorithms that we applied to the transformed dataset and the results that we obtained.

The link to the project GitHub repository is here

Algorithms

We have used four supervised learning algorithms on the dataset: Logistic Regression, Neural Networks, Random Tree, and RepTree. There are of course many other algorithms that can be used including XGB…


Algorithmic Risk Prediction for Life Insurance Applications through supervised learning algorithms — By Bharat , Dylan , Leonie and Mingdao (Jack)

In this two-part series, we will describe our experience of working on the Prudential Life Insurance Dataset to predict the risk of life insurance applications using supervised learning algorithms. We worked on this dataset as a part of our final group project in a graduate course on Statistical Learning that we took at the University of Waterloo in which we reproduced the results of a paper¹ and improved upon the work of the authors.

The link to the project GitHub repository is here and you can find the link to our YouTube video here

Business Context

Companies that underwrite life…


In Part 1(you can read it here), I discussed the Business Case for Predicting Visitor-to-Customer Conversion for an Online Store and covered Exploratory Data Analysis of the training dataset.

In this part, I will cover Data Preprocessing and the Application of Supervised Learning Algorithms, namely RandomForest and XGBoost to the prepared training dataset.

So without further ado, let’s go to Data Preprocessing!

  1. Data Preprocessing

“What you sow, so you reap”. This proverb, so true for life in general is also very much true for Data Science ! …


In this two part series, I will write about my experience working on a Kaggle Data Challenge (Here’s the Link) as a part of a graduate course on Statistical Learning that I took at the University of Waterloo as a Masters Student in Statistics.

In the first part, I will talk about the following two topics:

  1. Business Case behind this Data Science problem
  2. Exploratory Data Analysis

So, let’s dive in straight into the problem:

  1. Business Case:

The owner of an online jewellery shop wants to figure out a way to increase her revenues. …

Bharat Sethuraman Sharman

Data Scientist-passionate about getting the story out of Data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store