Comparing CSV files with sets

Rodrigo Girão Serrão
November 01, 2023 • Estimated Reading Time: 5 minutes

Hey there, 👋

How is your Python going?

In this Mathspp Insider 🐍🚀 email we’ll talk about CSV files and sets.

I don’t know what I’m doing!

I moved my newsletter from ConvertKit to beehiiv…

But I don’t know if I did it right!

If you are reading this email, would you mind replying with something like “It worked!” so that I know I didn’t screw up the migration, please? 🙏

The trick of the 327 disappearing subscribers

Like I said, I moved my newsletter from ConvertKit to beehiiv.

To do that, I had to export the subscriber data from ConvertKit and import it into beehiiv.

When I exported the data from ConvertKit, I got a CSV file with 16,099 rows.

However, when I imported it to beehiiv, I only got 15,772 subscribers…

In this email I’ll tell you how I used Python to figure out why I lost 327 subscribers in this process and who those 327 subscribers were.

Reading CSV files with the module `csv`

The module csv is a Python module from the standard library that lets you read CSV files easily.

Suppose you have a CSV file called data.csv with this data:

name,email
Rodrigo,[email protected]
John,[email protected]
Mary,[email protected]

You can use the module csv to read this data in four simple steps:

import the module
open the file
create a “CSV reader”
iterate over the reader to get the data

This is what those steps look like in code:

import csv

with open("data.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

If you run the code above, this is what you get:

['name', 'email']
['Rodrigo', '[email protected]']
['John', '[email protected]']
['Mary', '[email protected]']

As you can see, the module csv parsed each line of the file and returned the data of the CSV row in the form of a list.

Set comprehension with all emails

When I exported the data from ConvertKit I got a much bigger CSV file.

The file had more rows and more columns.

But the first thing I did was build a Python set of all the emails that I exported out of ConvertKit.

I repeated the steps above, but instead of printing each CSV row I used a set comprehension to extract the emails:

import csv

with open("convertkit_export.csv", "r") as file:
    reader = csv.reader(file)
    ck_emails = {row[1] for row in reader}

This produced a Python set with 16,099 email addresses.

(You can read this article about set if you’re not familiar with that concept and this article about comprehensions.)

Next, I had 15,772 subscribers in beehiiv.

I asked beehiiv to export all my subscribers and did a similar thing to get a set with 15,772 unique email addresses:

with open("beehiiv_export.csv", "r") as file:
    reader = csv.reader(file)
    beehiiv_emails = {row[1] for row in reader}

Comparing the sets

At this point, I have two different sets:

ck_emails
beehiiv_emails

What I wondered was “What emails did I lose in the process?”.

In other words, what emails are there inside the set ck_emails but that are not in the set beehiiv_emails?

How would you go about computing this?

Use the two sets below as a test:

ck_emails = {"[email protected]", "[email protected]", "[email protected]"}

beehiiv_emails = {"[email protected]", "[email protected]"}

# Write some code here to figure out what
# emails are in the ck set but not in the beehiiv one.
...


# This should show John's email, only.
print(ck_but_not_in_beehiiv_emails)

What we want to calculate is the set difference of the two sets and Python sets support that operation.

You can even use the operator - (the operator minus computes the difference between two numbers) to compute the difference between two sets:

>>> ck_emails - beehiiv_emails
{'[email protected]'}

So, what I did was use the operator - to compute the difference between the two sets and I got a set with 327 emails that I couldn’t upload to beehiiv:

diff = ck_emails - beehiiv_emails

I didn’t read every single entry in that set, but when I looked at the set, I saw something like this:

    # ...
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    # ...

In case you don’t know, those are emails from services that let you create disposable emails.

What probably happened is that people used disposable emails to get my Pydon’ts book, which is free, instead of using their own email.

The joke is on them, because now they won’t get the free updates that the book gets from time to time 😆

Sets are great, learn about them

There are two things I want you to take away from this email:

Knowing just a little bit of Python can really help you out with problems in your daily life!
Sets are great, learn more about them. If you don’t know the operations that sets support that well, check out the docs page about sets. You won’t regret it.