Written by
Dennis Jiang
Published on
November 2, 2024
My team and I started with a large dataset of Amazon reviews collected in 2023 by UCSD and UCLA students for AI model training.
We chose a 'smaller' subset of dataset of 1.8 million reviews in the CDs and Vinyl category. (Yes, people still buy CDs and vinyl!)
In this piece, we decided to not document every detail. Rather, our goal is to paint in broad strokes and give you a sense of what Rosie can do.
Of course, if you have any questions, don't hesitate to reach out.
When we loaded the data into Excel, we first encountered parse errors. Something very common.
We asked Rosie to identify and resolve these issues. Next, when we tried load the dataset using Power Query, we hit size limitations—another common issue with large datasets.
To overcome this, Rosie wrote us a Python script to break the file into smaller, Excel-manageable chunks.
From there, we reloaded the data into Power Query, and Rosie helped us extract insights from the individual reviews.
This is where it got exciting, so stick around to learn more!
site: https://amazon-reviews-2023.github.io
Here’s a glimpse of two records from the dataset. Each review has multiple attributes, and we explored both the text fields and other metadata to generate meaningful insights. (Spoiler alert: the commas separating JSON objects are missing, and this will create issues soon.)
We first loaded the dataset via Power Query, which—unsurprisingly—produced parse errors given the formatting issues mentioned above.
We copied a few records into Rosie and asked her to identify the problem. She found two issues:
To fix this, we asked Rosie for help.
As anticipated, the full dataset was too large for Excel. To solve this, we asked Rosie for a Python script to split the data into smaller chunks.
We then ran that Python script locally to split the file into smaller pieces.
With the smaller file loaded into Excel, we asked Rosie:
She immediately suggested ten different analyses.
Here’s what she came up with:
It turns out that people tend to write more when they like a product, less when they don’t, and even less when they’re indifferent.
According to the data, positive reviews have an average of 65 words, negative reviews have 41, and neutral reviews have just 9 words.
This experience was a total game changer.
In just minutes, we knocked out data challenges that would've taken hours—maybe even days—by hand.
With Rosie.ai, you're not just fixing problems—you're leveling up your entire Excel game.
Whether it's cleaning up messy data or diving deep into analysis, Rosie is like your personal data expert—always on, always ready for the next challenge.
And honestly, this is just the start. The possibilities are endless.
Imagine what you could do if you weren’t stuck on the tedious stuff.
So, what’s your next data challenge? Rosie's ready. Are you?
Click here to try Rosie out yourself with our interactive demo
Thanks for reading!
Dennis Jiang
Cofounder and CEO