People learning data analysis techniques often find themselves with difficulties applying the knowledge or gathering experience outside their job positions.
With this step-by-step guide, you can start your own personal data project that will keep you motivated to apply the results of your learning journey.
Hi, I'm Nazly, connect with me on LinkedIn to share ideas and keep the conversation going!
Projects are a great way to make sense of complex steps and challenges involved in data science; the difference between the processes of data engineering, data analysis, machine learning or data visualisation is sometimes blurry. Executing some of these tasks on a concrete case can help you master the fundamentals.
However, with this amount of resources, I have seen some problems:
Not knowing where to start or what to focus on, getting the feeling that there is still too much to learn, and not feeling ready to apply.
Struggling to find motivation and keep working on the project, not finding the topic or the case relatable or relevant enough.
Not finding a good balance between a challenge and a limited scope, the project can be way too difficult, or the goal may not match your needs.
As I explain in the video, you can overcome these barriers by starting your own project and applying a set of steps that will help you practise and at the same time will make you feel comfortable with data analysis in a professional context.
As a data scientist, I have had to perform different tasks in my professional roles. At some point, I had to perform one or all of these steps. By doing so, I was able to recognise my skills, estimate better the time I would spend on a project, or train and choose the roles in my teams.
I know that not every role or company allows for this type of hands-on experience. I've trained and coached hundreds of professionals looking to learn and apply data analytics techniques, and that’s how I realised that you can still get a good understanding when you apply them to your own case, preparing you for your next role challenges.
Now let's dive in.
How to apply these steps in a personal project.
My project context:
I like to work out, but I am not the most disciplined about it. As a data-driven person, data motivates me. So it was natural for me to get a nice smartwatch to collect data from my workouts.
Image generated with Bing Create*
All the data collected is nicely gathered and displayed in my Polar Flow app and webpage account. The interface is very organised and has graphs, reports, daily overviews and many other things in one place.
Concerned about my data rights, I checked whether I owned that data. As it turns out I do, I was able to download it 😍.
Despite Polar Flow being very useful, I sometimes find it too crowded, and I was limited by the filters and options of the interface. That's when I decided to try out something.
Stop for a moment and think about something you would like to work on.
Consider your day-to-day activities, hobbies, sports, work or your current learning journey. In general, think of:
Other projects I considered:
Understanding of my grocery shopping behaviour: scanning my receipts and creating a structure with text.
Seeing the currency exchange behaviour, Colombian peso vs Euro: using public data on the web for the past years.
Analyse my energy consumption: using the Eneco app to see where I can save money on heating.
Getting the data
We generate more data than we think. And I am not only talking about digital information. Imagine we count and register every time we open the fridge in a day.
Getting the data to see my opening the fridge behaviour would cost me a little effort. Maybe manually annotating in a notebook the time, the reason, or which item I put or took from the fridge.
For using this data in the project, we need to store it in a digital medium. Collecting it in an Excel or Google Sheet file is a better idea.
In this step, we can generally think of:
Collect, register and digitise data
Downloading data
Moving data from one place to another
In a professional setup, this step translates to storing information in a Database. Normally, different sources of information could be involved, like semi-structured spreadsheet files, scrapped information from the web, or other manually or automated extracted data.
Let's go back to my personal project. After downloading the data, I get a zip folder, containing different files in JSON format, a common data format used in web applications. This means that I can open each file with a text reader application like Notepad or Sublime, each file corresponding to all my data from one training session. However, this also means that I won't be able to see all this data together unless I perform other steps.
Making data usable:
In my previous 'opening the fridge' example, we register the data in the format that we want. We understand what is being registered, it is in tabular format (rows and columns) and we can keep it consistent. We store the date as YYYY/MM/DD and we have preselected actions.
However, this is not normally the case.
In this step, we deal with:
Separating and selecting specific data
Changing the structure and storage format
Manipulating and standardising fields
In data teams, this data manipulation is done by data engineers. The tasks can be diverse depending on the product and goals the data will serve. Structuring data coming from, for instance, web applications, contact forms or other databases.
The JSON files I got from Polar come with a structure of attribute values. If I read the file, I can get an understanding of what I see. However, it is not very useful if I want to do something with the data.
Some data manipulation is needed. This project is also a good opportunity for me to deal with files that are a little more complex than CSV or Excel files.
For this task, I also decided to refresh my Python skills and use Google Colab.
Explore the data:
Here is where the data analysis starts. The idea is to get an understanding of the data. To group fields, count the values, compare and visualise, giving context and getting information.
In the fridge behaviour data, we can create a graph counting the values of the different actions. After some days of data, we can find that we are opening the fridge to look inside as often as we do to put things in.
We can start to see patterns and understand the data we have.
In general, we want:
Aggregate columns getting counts, average, maximum and minimum values
Create quick graphs to see data behaviour
Identify the type of values per column
In the ideal world, data is clean and ready to be used. The data analyst performs exploration using visualisation tools like Tableau or Looker, aggregations and queries in SQL, quick observations in no-code tools like Knime and spreadsheets for easy handling of data when possible. This step is for the analyst to understand the data.
In my Polar project, I already had a good understanding of the data available thanks to some visuals that Polar Flow was giving me. I use this step to get a better grasp of the information: the counts of workouts per week, the average duration of a workout, maximum calories burnt in a day when I did cardio and strength training.
Ask questions to the data:
This is the task I love the most about data science. It is where the technical skills and rigour meet creativity. We want to ask out of curiosity. Nothing is obvious, and we are guided by the knowledge we got from the data in the exploration step.
Let's think about that fridge. How many times do I open the fridge in a day? Easy, I can just count the records I have and group them per day.
But, do I open the fridge every time I go to the supermarket? Do I open it when I cook? When do I open the fridge just out of boredom to look inside? How often does this happen? is this normal behaviour or something atypical? To answer these questions I need to be more inquisitive and manipulate the data to get answers.
Normally in this step we:
Prioritise what to ask vs how difficult it is to compute the answers
Integrate other sources of information, e.g. my shopping list
Contrast the data with different references, e.g. day, month, season
In a professional context, this step is where the business understanding is crucial. There will be questions from different stakeholders and communication and exchange of expertise are key to answering and keeping the analysis relevant. A project manager role is also very important to keep efforts focused and scope limited.
In my project, despite the Polar Flow app being very interesting, I wanted to link that information with my questions. Probably the answers are already there in those big dashboards, but I won't look at them until I have a concrete interest when looking at them. I have simple questions.
How often do I work out? What is the percentage of my physical activity per week? How is my average heartbeat changing each week? Ultimately I want to link these questions to the improvement of my habits and health.
Use the answers:
It is an overlooked step but a crucial one. This is where many data science projects fall short. You want to go through the effort of all the previous steps to take action. The purpose of the project is to make data-informed decisions.
My opening the fridge project is very simple but it could evolve into something that can add value. I could use it to see if I am buying the same items too often and optimise my shopping list. Maybe I am also overstocking and opening the fridge to throw away expired food. I can link my questions to a strategy that could make me save costs, optimise my time or stop going to the fridge when I am bored.
This step can be used for:
Test hypothesis and run experiments
Perform an action and see the impact on the data
Get inspired to ask further questions
In business, the first iterations of the data analytics efforts can give us a baseline to give sense to the metrics and data we obtain. This is the step where stakeholders can share their interpretation and input about certain behaviours. This is knowledge to be used in further iterations. By using the results of analysis, other projects are fueled and data-driven culture reinforced.
The first iteration of my Polar data analysis resulted in a very simple board with metrics that helped me to keep consistent with my goal. I wanted to keep working out at least 3 days per week and burn at least 600 calories. If the colour of the numbers turned red, then I knew I didn't do well that week.
Keeping this data updated also challenges my skills and encourages me to learn new tools and methods. In data science, constant learning and updating is part of the job. The first time, my always-useful Google Sheet was enough. When I wanted to automate this, then I had to switch to Python.
Getting hands-on in your project is a great way to apply the data techniques you learn. It is a great way to get a challenge while at the same time seeing the value of data analysis.
The practice you obtain translates to professional contexts. Each business project will have its particularities, but you will be equipped to see in which step you are involved and which tasks are expected.
The personal project doesn't need to be fixed. You can enrich it or change direction -just like in a business setup. The idea is to keep applying further knowledge, for instance, advanced analytics and machine learning techniques,
Did you get an idea of a project you want to start but still have questions about how to apply the different steps?
If you have a Growth Tribe membership, use the Ask an Expert feature to contact me.