4 Mistakes You Can Avoid for Exploratory Data Analysis

Shenghong Zhong
10 min readJun 18, 2021

“The teacher learns more than the student. The author learns more than the reader. The speaker learns more than the attendee. The way to learn is by doing.”

I completed my project002 exploratory data analysis about the UK housing market over 12 years this week. In case you’re interested, you can read my post: 10 Insights You Should Know About the UK Housing Market. Wait, you don’t have time? You can watch my presentation on YouTube — here

What’s the behind the scene of my project? The term, Behind the Scene, was from a workshop I attended 2 years ago. It’s called “Nail Your Personal Brand”. I booked it without thinking twice.

On the day, the instructor saw me shockingly when I entered the room. I didn’t feel weird until being surrounded by only women in the workshop. More obviously, the content and discussion were tailored for females. I pretended to be carelessly concentrating, and quickly glanced at the description for the workshop on my phone.

Fantastic. The workshop was for female entrepreneurs, specifically.

However, I learned a lot of principles in the workshop. Of which, summarizing mistakes would encapsulate those invaluable experience for future self. Thus, I’d love to talk about 4 mistakes while doing my EDA project .

Mistake 1: The Thought — “I spent 105 hours”

You might wonder why this is a mistake. Before my explanation, I’d encourage you to ask yourself:

  • Have I always reading books/articles/blog posts about how to save time, organize time, squeeze more out of time?
  • Have I always looked for time-management techniques to maximize productivities?
  • Have I always bought courses/workshops/videos by new gurus appearing on the Internet promising “the solutions”?

I’m not sure if that’s you, but it was me years ago. I read somewhere saying “time is a finite resource; time is money.” Hence, I must maximize every space moment of productive capacity. I even hired a productivity guru to help me out. With tons of money spent, I did get better at it. I have my own system to organize time and files. I can get any files in 10 minutes. But one thing I was missing could destroy my career life, possibly.

I often have no ideas about time estimation. Seconds are slipping away making me nervous. I’ve been searching for good methods to estimate my time spent. I constantly asked myself:

  • Is this really the best way to spend my time?
  • Isn’t there something more productive that I could be doing?
  • Shouldn’t I check my e-mail, review task lists, or watching online courses?

It felt good when I ticked off tasks because I’ve spent in the right place. It felt bad when I wrongly predicted the time spent on tasks. Because losing the game wasn’t acceptable. Whenever I failed to do what’s needed, I accumulated debts that I’d never pay them off. More and more, I unwittingly became a slave to tasks.

Whenever I failed to do what’s needed, I accumulated debts that I’d never pay them off.

The lightbulb moment was when I realized what’s driven this insecurity is the hidden premise: time is a finite resource. It’s tricky to recognize that the assumption is a statement. It’s a factual statement that mixed the fact with opinions.

Time is a perpetual motion machine

No matter what happens, time is like a machine, a perpetual motion machine. however manage them well, squeeze them, save them, time is going away from your hands. But the false belief led me to another mode: the slot machine mode.

The Slot Machine Addition

We put money in a slot machine hoping that the next pull of the lever will pay off. The more coins we gamble, the closer we think we’re getting to a big winning, but in fact, the probabilities of the game remain the same. 100 pulls, 1000 pulls, ∞ , chances are we will nearly always lose our money staying the same. Of course, some outcomes may occur — just like small jackpots that keep you pumping coins into the machine — but this kind of game only drains my energy bank account over time.

On the other hand, I started to understand I should shift my mindset into an investor's mind. An investment mindset is focused on the long term. I know that I wouldn’t see an immediate return on my investment. But I believe I will see significant returns over time if I work the plan out. Like this article you’re reading, it’s my investment for the future.

Your superman productivities and time management system aren’t priorities.

Your actions are your real priorities.

Mistake 2: Vague Goals Lead to Linear Work

Linear work has another fancy name, Waterfall Model. Some may know the difference between waterfall workflow and agile workflow. Linear work refers to a project divided by different phases and activities, with a state that everything needs to be done properly.

Why is this approach less good for my case? In Python, tools for data visualization are Matplotlib, Seaborn and Plotly, ggplot, Bokeh, pygal, Folium etc. The list can go long endlessly. Some would think, “Okay, I have to learn these packages properly so that I can complete projects” Then, it was exciting at the beginning and they’re super motivated because they can do all kinds of cool stuff after learning properly.

Our Changing Motivation

Next, many go into the rabbit hole of Matplotlib, colours, best way to write codes, size of figures, etc. In the middle, worse results may happen — give up or burn out. As the deadline is approaching, they start doing projects again under the pressure, cramming until getting things done.

What a relief, right? At least, the project is done. That’s okay if this is your first time. That’s fine if it’s for self-learning as no one evaluates your work. But the habit you unconsciously created may affect your wellbeing at future work.

Let’s look at the goal, “ I want to learn X properly.”

I’d ask a question — what does “properly” mean? You may suddenly feel difficult to elaborate on specific activities and details. That’s a right feeling. Because many fall into the trap of being perfect and too vauge.

A friend always stays up late. worked on projects over time. He thought whatever projects are, he could finish tasks eventually, even if he needed to work on the weekend. The only difference is he wanted everything to be perfect.

Okay, that’s fair. I asked him to explain how he planed. “ I’d finish this first, and then complete that second.” What he said sounded right but in fact, he always looked through all websites to get all information before the first draft, often unclear with what he wanted except being perfect. In his mind, with a lot of time spent, being perfect could be achieved.

It wasn’t being perfect is bad. However, how long it would take, I’m afraid, remains a question. Furthermore, what if perfect is a moving target? What if you never be able to reach the aim of being perfect within a short time, considering no one pushes you but you?

It’s human nature to announce vague goals. I’d like to hear how you’ve been doing your New Year Resolutions? (If you had made them)

A limit is like an unattainable goal. You get closer and closer to it, but you can never get all the way there. Perfect is the limit. Getting a project perfectly done without time-bound is a vague goal.

The vague goals lead to the start of linear work.

  • Linear work tricked your mind to think that you can spend much time without thinking of the risks of burning out.
  • Linear work tricked your mind to think that you can be comfortable by thinking “If I learn it properly, then I’m unstoppable.”
  • Linear work tricked your mind to think you’re safe as long as you can get fancy features, and it is going to blow people mind if they can see cool stuff.

All in all, the value of our work to companies is not measured by how much time you spent but determined by what we create. However, we could do it better using the iterative work mode

Mistake 3: Unclear Minds with Must Have and Good to have

“Too often, we fall into an all-or-nothing cycle with our habits.The problem is not slipping up; the problem is thinking that if you can’t do something perfectly, then you shouldn’t do it at all.”

On the other hand, our brains like to be “lazy” because science explains the brain wants to save energies. However, this doesn’t mean we can’t think strategically. There is a better way to approach the project — plan for bare requirements for the completion.

“Planning and preparation are useful until they become a form of procrastination.Is this task enhancing my actions or substituting for them?”

How do we thrive? We use tools. Human use tools to survive for thousands of years. How to use fire would be perhaps the best example.

The tool is using 2 columns Must have and Good to have to divide the project. What’re the bare requirements that I can still complete eventually, even if a natural disaster like an earthquake happens?

However, I made a mistake here. As analyzing the housing market in the UK, it’s inevitable to encounter location data. Initially, I thought it’d be very cool to see a map with different colours representing volumes of house sales data on the UK map. I can use the package Folium in Python. I didn’t realize how challenging it was, and I put it into the Must-Have basket.

Firstly, my familiarity with the library Folium is relatively new. I know some basics and went through some examples by reading the documentation. But examples are using the US map rather than the UK map. I have to figure out how to hook the UK map. I thought it was easy to replace with parameters. It turned out I was naive.

This led me to another problem with a shadow understanding of UK geography. Because I’m from China. There is way too much difference in geographical definitions and systems between the UK and China. I spent hours reading Wikipedia pages and articles to understand how the UK uses terminologies. To date, I still can’t really say I master the knowledge though. But I’m getting there.

What would I do to make it better if restarting the project?

I’d learn to make micro-decisions to change items from must-have to good-have.

Mistake 4: You Have to Make Mirco-decisions

What does it mean by micro-decisions? I put visualizing information geographically into the Must-have basket. because I overestimated my ability in using the unfamiliar library. It’s the moment where I should say, “Let’s put it aside and come back to it later.”

I didn’t. I chose to defeat this monster at that moment. I did.

Well done I’d say to myself but it’s not smart.

Why?

My reflection was visualizing information geographically wasn’t necessary to my questions in the project. What’s more, the devil is in the details.

Because I encountered a problem many data analysts absolute hate. That is text data cleaning. When merging data from another dataset, you’d be most likely to clean data. What I encountered was a few rows didn’t match but they refer to the same thing. For example, the dataset for the project contains a row, “ City of Bristol”, whereas another dataset has a row, “Bristol, City of”. I spot on this situation and cried inside. I frustratedly tweeted.

Then, I chose manually check. However, I converted the text all into lower cases so I checked 45% of data rather than 100%. But it’s still time-consuming and pains in the neck.

Another good thing is that I’m aware of this problem When I read blog posts from this blog, I found the solution. It’s going to be my plan to learn for it. I found that in Python you can use Python Record Linkage Toolkit , fuzzymatcher or fuzzywuzzy. At that time, I think it’d take me more time than I planed if I let myself explore them.

Conclusion

More time doesn’t mean good qualities

I did spend 105 hours on this project, but it involved data selection, asking questions, finding the best chart to answer, making slides, practising the speech for presenting, proofreading my writing again.

Mistakes are learning points

Next time I would know if I improve or not. The next project would be better.
I challenged myself for 22million records and the experience of dealing with a massive dataset is valuable.

Data science is a field for lifelong learning

In data science, every day is a school day. In the future, I’d ask for feedback., read others’ analysis, seek inspirations.

Keep learning, keep coding!

About the author

Shenghong(David) Zhong is an aspiring data scientist, amateur comedian, Indian food enthusiast based in London, the United Kingdom. Artificial Intelligence is changing humanities. Then he made the decision to see if he can be parts of the big movement.

Twitter:@ShenghongZhong

LinkedIn:@David Zhong

Instagram:@davidzhongg

YouTube: Thedatum

Jovian:@Shenghongzhong

--

--