It may be because I’m a writer, or it may be because I know I’ll forget anything I don’t immediately write down, but documentation comes easily to me. I always feel like as soon as I learn something, and how I learned it, all the surrounding circumstances, I need to document it.
Especially when it comes to datasets, anything that you glean tends to be important in the grand scheme of things. That one bit of information can decide whether you use it as part of your presentation or if you discard it and find a different angle. It could be part of a larger trend or it could be small enough to ignore.
One of the things I love about data and analysis in general, doing presentations, reports, whatever it may be: no information goes unmentioned. Everything is considered. Is it part of bias, how old is the data, does that contribute to the way the visualizations look, what recommendations do we have for next steps, what does it look like to put this analysis and the conclusions of it to use?
At the moment, I’ve got a couple of things going at once. I’m wrapping up the Curveballs v Fastballs analysis with a presentation that I’m hoping to record, and I have the 3 year anniversary grocery project update I’m working on. Because I wanted to get back into playing with SQL, I picked up the Video Game Sales dataset that I’ve had for about a year now. I poked around on it in Python, which you can see here.
There’s always something so exciting to me about getting into a dataset, whether it’s for the first time or it’s been a while, like in this case. Especially considering the last time I got into this was before I had some of the current tools I have now, from different ways of thinking down to different ways to actually see what the data has for me (like with BigQuery).
I had a completely different perspective on how to get into the data, how to clean it, what to include and not, and a different way to get answers I hadn’t considered before.
One of the ways that I stalled out before with this dataset was that I was having a hard time figuring out what to do with the hard flat line at the bottom, where a good chunk, easily about 1/4 of the entries, sat. For a while I puzzled over it, until I went back to the entry on Kaggle and found that the sales themselves are done in millions.
This can be very exclusionary depending on the entries present in the dataset. Not everything is going to reach a million, so you have a bunch of entries that don’t even really get plotted. I’m tempted to not get into the data for that reason alone and maybe find a way to find something similar that would include more variety on the sales. Or just do a different analysis.
At present my thought is to move forward with the initial idea, and to have a slide when I create the final presentation, pointing out the flaw with some of the entries and how much less detail they really provide. I did this with the Ice Cream Code report as well, making sure that I noted throughout that the data was not only a few years old, but it also didn’t cover very much and had limited information.
No matter what ends up happening with the dataset, whether I find there’s not enough there for my original idea or not, it’ll be worth the effort just to see what all is there.
I’ve had this dataset loaded into BigQuery for a few weeks just in anticipation of working at it. There are 11 total columns, and the names are Rank, Name, Platform, Year, Genre, Publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales.
The dataset is from 2017, so it’s about 6 years old. From the basic queries I’ve done so far, I’ve found that there are 32 different platforms, 12 different genres, and 580 publishers in total.
The platforms, most abbreviated, are as follows: DC, DS, GB, GC, PC, PS, XB, 3DS, GBA, GEN, N64, NES, PS2, PS3, PS4, PSP, PSV, SAT, SCD, Wii, 2600, SNES, WiiU, X360, XOne, GG, NG, WS, 3DO, PCFX, TG16.
The genres are as follows: Platform, racing, sports, adventure, action, simulation, misc., role-playing, puzzle, strategy, shooter, and fighting.
I wasn’t sure what to expect with publishers, but I can safely say I wasn’t expecting nearly 600 total, distinct publishers.
One of the inconsistencies I noticed was that despite this dataset being updated last about 7 years ago, there are entries for 2017 and 2020. I will probably be removing them because out of 16,000+ entries, entries for those years total about four. I’m not sure why they’re there in the first place? Likely vgchartz.com, where they were scraped from, had those entries added as they were expected to release on those years.
Okay, one last basic query and then I’ll wrap this up.
I essentially wanted to do min and max of sales on each sales column. I still need to work on my more complex query to get more information than just the number, but I found the following for each of the sales categories:
NA_Sales max was 41.49
JP_Sales max was 10.22
EU_Sales max was 29.02
Other_Sales max was 10.57
And finally, Global_Sales max was 82.74
Min for all was just 0.0. It didn’t feel particularly noteworthy, and I was concerned that would happen anyway considering how the sales were set up.
That’s all my cursory queries and analysis for now. The next post I make will include the rest of the more complex analysis, and with any luck it’ll either be the report or presentation, whatever I decide to make the end result for this one.
As for my next post, I’m hoping it’ll be another portfolio update. Stay tuned regardless, I’m hoping to keep up the weekly pace! Until then, I hope you have swift and easy data cleaning, no matter what form it takes.
Leave a comment