This was the second project for IE332. The goal was to create the best machine learning model given training data where our goal is to classify positives and negatives. The training data has missing values, far more negatives than positives, and we were only allowed to select a certain number of training data. There was also more of a penalty for false positives than false negatives. Overall, I would say this project was harder for me to understand at first, but towards the end I understood everything.
The first step was data selection. Since it's uncertain what the best selection method is, we worked on two methods. I worked on one of them where the idea is to first select the data with no NAs, do some training with logistic regression, retrieve the weights of the columns, and rank based on the least number of NAs but affected by weight. I was able to achieve this aspect in the code, though ultimately my method wasn't chosen for data selection.
Our team tested various ML algorithms. I was in charge of testing Logistic Regression. Once we all coded our algorithms, I coded a tester to test the overall accuracy of everyone's algorithms. Since I got to work with everyone's algorithms, I got a good understanding of how everything works.
Since I had a solid understanding of the project after all the testing, I combined all the code algorithms into one file. The final contained everyone's data selection method, various algos we tested through different data partitions. Analysis of each trial, and predicted results.