Rosie in Real Life
7
mins read

Rosie Tackles Big Data

Rosie.ai tackled a massive dataset of 1.8 million Amazon reviews by fixing errors and creating a Python script to slice the data into Excel-friendly chunks. She quickly delivered insights like sentiment analysis and review quality. What could’ve been a manual nightmare became a fast, painless analysis.
Written by
Dennis Jiang
Published on
November 2, 2024

=About the data

My team and I started with a large dataset of Amazon reviews collected in 2023 by UCSD and UCLA students for AI model training.

We chose a 'smaller' subset of dataset of 1.8 million reviews in the CDs and Vinyl category. (Yes, people still buy CDs and vinyl!)

In this piece, we decided to not document every detail. Rather, our goal is to paint in broad strokes and give you a sense of what Rosie can do.

Of course, if you have any questions, don't hesitate to reach out.

=Quick overview

When we loaded the data into Excel, we first encountered parse errors. Something very common.

We asked Rosie to identify and resolve these issues. Next, when we tried load the dataset using Power Query, we hit size limitations—another common issue with large datasets.

To overcome this, Rosie wrote us a Python script to break the file into smaller, Excel-manageable chunks.

From there, we reloaded the data into Power Query, and Rosie helped us extract insights from the individual reviews.

This is where it got exciting, so stick around to learn more!

=Downloading the data

site: https://amazon-reviews-2023.github.io

=Laying the Foundation

Here’s a glimpse of two records from the dataset. Each review has multiple attributes, and we explored both the text fields and other metadata to generate meaningful insights. (Spoiler alert: the commas separating JSON objects are missing, and this will create issues soon.)

=Loading the Data

We first loaded the dataset via Power Query, which—unsurprisingly—produced parse errors given the formatting issues mentioned above.

=Solving the Parse Errors

We copied a few records into Rosie and asked her to identify the problem. She found two issues:

  1. Missing commas —JSON objects in an array should be separated by commas. Ours were missing.
  2. Missing enclosing structure—JSON objects need to be enclosed in square brackets if they’re part of an array.



To fix this, we asked Rosie for help.

"Rosie, help me write a Python script to fix the two errors you highlighted."

=Dealing with Large Files

As anticipated, the full dataset was too large for Excel. To solve this, we asked Rosie for a Python script to split the data into smaller chunks.

=Splitting the Large File

We then ran that Python script locally to split the file into smaller pieces.

=Extracting Insights with Rosie

With the smaller file loaded into Excel, we asked Rosie:

“What insights can you help me uncover from these reviews?”

She immediately suggested ten different analyses.

Here’s what she came up with:

  1. Rating distribution – Analyze trends in 1- to 5-star ratings.
  2. Most frequent words – Identify common phrases across reviews.
  3. Helpfulness – Explore which reviews received the most helpful votes.
  4. Timestamp analysis – See if sentiments shift over time.
  5. Verified purchases – Compare the sentiments of verified vs. non-verified purchases.
  6. Top-reviewed products – Determine which products received the most reviews and analyze their sentiment.
  7. User analysis – Find out if certain users consistently post positive or negative reviews.
  8. Length of reviews – Check whether longer reviews tend to be more positive.
  9. Common phrases in positive/negative reviews – Spot phrases that often appear in highly positive or negative reviews.
  10. Correlation analysis – See if ratings correlate with helpful votes.

=Asking Rosie to help us analyze the data

"Are longer reviews more positive or negative?"

It turns out that people tend to write more when they like a product, less when they don’t, and even less when they’re indifferent.

According to the data, positive reviews have an average of 65 words, negative reviews have 41, and neutral reviews have just 9 words.

This experience was a total game changer.

In just minutes, we knocked out data challenges that would've taken hours—maybe even days—by hand.

With Rosie.ai, you're not just fixing problems—you're leveling up your entire Excel game.

Whether it's cleaning up messy data or diving deep into analysis, Rosie is like your personal data expert—always on, always ready for the next challenge.

And honestly, this is just the start. The possibilities are endless.

Imagine what you could do if you weren’t stuck on the tedious stuff.

So, what’s your next data challenge? Rosie's ready. Are you?

=Join Us on the Journey

Click here to try Rosie out yourself with our interactive demo

Thanks for reading!

Dennis Jiang

Cofounder and CEO

Rosie AI Newsletter
No spam—just occasional updates on the latest with Rosie AI, plus a chance to see Rosie in action right in your inbox!
Read about our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Ready to turbocharge your productivity?

Spend less time on gruntwork and more on strategic thinking.