Knowledge discovery from data

“Data is the next oil”, I heard someone say once.

People have been using data in various fields for years, trying to understand human behavior, market trends, weather and climate predictions and much more. However, only recently did this term ‘Data’ become a trending topic. There were months when I kept seeing news about tech giants assuring the public and governments repeatedly about user data being handled ethically. I assume that was when most of us started to realize just how powerful data can be.

During my final year at university, I took an elective on Data Mining that helped me see exactly how to use the resource of data for discovering new knowledge. Data itself has no meaning, but the context defines how to derive useful information from it. The exciting part about this kind of unsupervised learning is that we do not usually know what we will discover. That is why for our course we were tasked to explore this process of finding meaningful information from a dataset using agglomerative hierarchical clustering.

We were provided with a dataset concerning absenteeism at work from the UCI Machine Learning Repository which can be found in the following link – UCI archive – absenteeism at work dataset.

The following document contains my report that highlights the process through which I analyzed absenteeism data and the major findings that provide better insight into the recorded examples of employees.

Finally, there was one observation that our professor made us aware of during the course: Bias can sometimes interfere with the process of knowledge discovery. As humans end up determining what is useful or not, varying opinions and previously acquired knowledge, can skew the interpretation from data analysis. This become apparent as I was trying to find combinations to visualize the data against various attributes, relating scenarios that seemed rational to me. I only hope that as technology advances, intelligent algorithms learn from unbiased filters and processes that do not end up excluding important human factors.