⌨ Projects Details

Clinical RAG Chatbot | Python, FastAPI | Code

➢ Demo: https://www.youtube.com/watch?v=SwiBSho1-KY

Uganda Clinical RAG Chatbot Demo for Harvard Health Data Science Capstone 2024 Fall

https://www.youtube.com/watch?v=SwiBSho1-KY

Dashers - Daily Meal Assistant App | MLOps, Docker, k8s | Blog

➢ Architecture:

Mortality Rate in the US - Streamlit App | Python | Code

➢ Play this app: https://cirrhosisappapppy-nxgcuhjwvfpte6haw7gzby.streamlit.app/

Pneumonia Chest X-ray Classification | TensorFlow | Code & Report (Kaggle Competition)

➢ For the data augmentation techniques, the ImageDataGenerator function provided by the Keras library is used, including various augmentation techniques (vertical shift, zoom, horizontal flip, brightness adjustment, and shear adjustment) and 9 parameters. The reasons these augmentation techniques are picked for the Chest X-Ray medical image data classification (Pneumonia vs Normal) problem are because these techniques can introduce variations in the training data, which can help the model generalize better and reduce overfitting. Specifically, vertical shift and zoom can account for variations in the positioning and scaling of the chest X-ray images. Horizontal flipping can help the model learn to be invariant to left-right orientation. Brightness adjustment can simulate varying exposure conditions, and shear adjustment can account for minor rotations or distortions in the images.

➢ In terms of model architecture, two models are implemented. The first is a primitive CNN model built from scratch with randomly initialized weights, whose architecture is inspired by the VGG16 model, and the second one is a transfer learning model, which is based on an EfficientNetV2S architecture with weights pre-trained on the ImageNet dataset. Transfer learning is used because pre-trained models have already learned rich feature representations from a large dataset, which can be beneficial for the task, especially when working with a relatively small dataset. By using a pre-trained model as a feature extractor and fine-tuning the final layers, the knowledge gained from the larger dataset can be leveraged to potentially achieve better performance with fewer training samples.

➢ An accuracy of at least 80% is expected in the hidden test set, but the leaderboard finally shows about 90%. The high accuracy on the test set can be attributed to the effective data augmentation techniques, the use of transfer learning with a powerful pre-trained model, and the appropriate model architecture tailored for the task. The data augmentation methods introduced more variation into the training data, while the transfer learning approach allowed the model to benefit from features learned on a larger, more general dataset. Additionally, the choice of the EfficientNetV2S architecture, which was specifically designed for better performance on various computer vision tasks with fewer computational resources, likely contributed to the strong results on the test set.

Python Package Development | Python | Code (Group project)

➢ Package Demo jupyter notebook https://drive.google.com/file/d/11KaG1ls_La_XVNucvdqjrJ9Y2v5nJ-Gm/view?usp=sharing

Mobile Health Activity Classifier| Python | Code (Group project)

➢ Developed a robust classification model using the MHEALTH dataset. This model harnesses data from motion sensors and vital signs collected at three different body locations: the chest, right wrist, and left ankle. Acceleration patterns, rotation rates, and magnetic filed orientations are also considered to help determine each activity (120 instances, 14 variables, and 12 physical activities in total)

Pediatric Sleep Patterns Detection from Wrist Activity Using Random Forests | Python+R | Code +Web(Self-guided) Affiliation: Department of Biostatistics, Harvard University - T.H Chan School of Public Health

➢ Trained a Random Forest sleep pattern detection model using wrist-worn accelerometer data with accuracy, recall, specificity above 98% and out-of-bag error rate 0.0173 ➢ Identified top3 most influential features hour of the day, enmo, and anglez, indicating the significance of movement intensity, orientation, and time in determining sleep states

Multi-source Transfer Learning Models: NST and Conditional CycleGAN | Python | Paper (co-authored with my UofT collaborators) Affiliation: Department of Computer Science, University of Toronto

Paper quick look (haven’t been uploaded to arXiv)

Semantic Segmentation in Autonomous Driving | PyTorch | Code (Self-guided) Advisor: Zheng Wu, Department of Mechanical Engineering, University of California, Berkeley

➢ Labeled the pixels of a road in cityscape images using Deeplab and Fully Convolutional Network (FCN) models ➢ Applied Vgg as pre-trained models and achieved mIoU 50%

Gratitude to Strangers | Figma | Designer of interactive computational media | Demo Advisor: Fanny Chevalier, Department of Computer Science, University of Toronto

➢ Designed User Interface for our app Samaritan whose mission is to realize real-time anonymous feedback to kindness ➢ Determined the effective sample size and conducted user research with k-means clustering for target segmentation ➢ Drew paper sketches for the user interface and designed interactive prototypes using Figma

Charity Online System Platform Construction | SQL/Python+Django | Demo Code Advisor: Changjiang Zhang, Department of Data Science, BNU-Hong Kong Baptist University UIC

➢ Leveraged Python to crawl and collect charity information from official charity websites ➢ Cleaned unstructured data and conducted exploratory data analysis, stored via MySQL ➢ Used Django as a framework to connect the front-end interface and back-end database ➢ Supplemented by visualizations such as customized donation maps

Productivity Calendar App | Java, Shell | Code

➢ An application designed for scheduling events/tasks and reminding users of then, which can be used both personally or by a company: Tasks/events can last for a timeframe or occur at a single-time, be auto-generated every day/week/month, be given descriptions, be visually displayed as a checklist or on a calendar, and be given thematic labels for filtering. Tasks can be divided into subtasks, given progress labels, and have comments added. Furthermore, the history of a task can be tracked, and reminders can be auto-generated about upcoming tasks. Users can additionally share their calendars with other users, who can access the calendar given a calendarID, and be given different permissions for that calendar.

Computer Graphics Practices | C++/OpenGL | Code Advisor: Karan Sher Singh, department of Computer Science University of Toronto

➢ Raster Images, Ray Castering, Ray Tracing, Bounding Volume Hierarchy, Meshes, Transformation, Shader Pipeline, Kinematics, Mass-Spring Systems

Education agents — who are my customers? | Python | Demo (Self-guided)

➢ Assumed working as an education agent who focuses on helping undergraduates apply for overseas postgraduate studies. Aimed to better understand the customers ➢ Collected data from the undergraduate students, including the demographical data and preferences concerning overseas postgraduate studies, and used the data to identify different segments of undergraduates ➢ Implemented k-means clustering tool for market segmentation and study of consumer behaviour ➢ Chose Davies-Bouldin index (DBI) as a measure for k-size, Mixed Euclidean Distance as a measure for nearest neighbours, figured out Top3 most influential factors

Construction and Analysis of Music Genre Maps | Python/MATLAB/R | Sample (Self-guided)

➢ Used clustering models to determine whether artists within a certain genre are more similar than artists across genres, processed genres’ high dimensional feature data into the PCA-CS/PCA-model; visualized similarities between two genres on a heat map, found that among all music features danceability and energy characteristics are most "contagious" and showed stronger correlations with popularity ➢ Used Box-Jenkins multivariate time series model, found that the value of average music features fluctuated greatly between the 1920s and 1960s and specifically in terms of two key fluctuating characteristics (danceability and valence) and analyzed the fluctuation alongside the emergence of blues and increasing popularity-related genres such as rock and jazz, deduced the key impact factors for followers, and constructed a graph showing the intertwined relationships of various musicians

Improved Multiple Regression Analysis and Prediction | R | Code

➢ Examined the house price dataset and selected the best-performing multivariate regression model to explain 86% of the data with appropriate transformation and outliers, deleted seven candidate models in two different methods, and achieved the goal of predicting house prices with the conclusion that buyers prefer houses near rivers, far from factories, with less nitrogen oxide emission and more rooms, and that houses with these criteria sell at a higher price

Monte Carlo Simulation - Maximum Likelihood Method | R | PreSlides

➢ Contribution: Background Knowledge/Data Analysis/Simulation: Analyzed admission status by applying the maximum likelihood estimation and the sample method, used Monte Carlo method to generate the distribution assumption based on the analytical results, and performed a distribution test and the comparison and found that the maximum likelihood method was more reliable when calculating the estimators than the sample method, whereas the estimators based on the large sample were more reliable than those based on the small sample ➢ Concluded that universities show signs of preference in admitting students who have an undergraduate CGPA around 86% and that admission officers value a student's comprehensive ability rather than solely focusing on CGPA statistics

Factors Impact Student Performance on Course Assessment Under Pandemic | R | Code

➢ Figured out most fitted MLR model with interaction factors, concluded COVID is not a key factor whereas weekly studying time and its interaction effect with office hour attendence is

Causal Relationship Between Countries GDP per Capita and Education Index Value | R/Python | Draft Acknowledgement to UofT Politicial Science department

➢ Data has been collected from Human Development Index (HDI) reports by the United Nations Development Program (UNDP) from year 1990 to 2017 ➢ Applied Causal theory to study the relationship between countries’ GDP per Capita and its determinant, namely, education index value, and four other control variables: median age of countries, inequality-adjusted education index, unemployment rate, and the gender development index

Toronto Movie Released dates vary on the Schedule of Holidays | R/Python | Visualization Code

Toronto Bicycle Theft | R/Python | Data Storytelling & Visualization Code

Differences in Education Systems Worldwide | Tableau/Html5/CSS/Javascript | Weblog