Portfolio Post: Taking the Groceries to Python

Written in

by

I admit that I was hoping to have a little more to talk about here for this post, but most of it is about cleaning data. That doesn’t seem too unusual, considering so much of the data analyst and scientist’s work tends to be cleaning.

With that being said, I do want to focus on my data keeping and entry process for this, and how I figured out what needed to change, then how I’m going to go about that.

One of the biggest parts to collecting this data originally was trying to figure out how to keep it. I started in Excel, with each row a specific date, and several columns to fill out for all of the pieces of data I wanted to collect from the receipt. This isn’t too interesting unless you have an understanding of how Excel and its formulas translate to a csv read in Python.

Which is to say that from Excel, with the formulas used to calculate how much was spent on each section of the grocery store every week, and the way the currency was documented, I realized I essentially locked myself into so many of the categories being objects and not int.

It’s not hard to correct, but I realize I also have to work on the hour format for when we checked out our groceries. Hah.

I am almost tempted to go back and fix it all in Excel, and that reflex is one I have to fight constantly. I know Excel far better than I know Python, just by virtue of having used it for most of my life, so it’s hard to lean into the cleaning power of pandas instead when I know how to take care of the problem in Excel.

This was my face before I imported plotly and got my total cost versus date time series line chart, and after, when I realized the total cost wasn’t behaving the way I wanted it to:

So that’s pretty unfortunate. I ran the .describe() and info() before, and I just didn’t notice the entire batch of “object”.

I’m considering sharing this dataset far and wide once I clean it up a bit, so please feel free to download it from Github as well and play with it if you like. Like I mentioned before, the timeline is essentially summer 2019-2022 with some outliers and some bits missing.

You can also watch my Github to see any updates on my current projects, so feel free to click through on this link.

My work on those projects is slowing down due to vacation from last week, personal reasons, and because the residency program is wrapping up! Next week I’ll write a deeper blog post on how that’s gone over the course of the past 8 weeks, some of my work, and how it turned out. It’s a bittersweet ending, but I’m excited to be on the other end of it.

As always, I wish the swiftest data cleaning to you and I’ll see you all next week!

Tags

Leave a comment