Comparing CSV files with sets
Hey there, 👋
How is your Python going?
In this Mathspp Insider 🐍🚀 email we’ll talk about CSV files and sets.

I don’t know what I’m doing!
I moved my newsletter from ConvertKit to beehiiv…
But I don’t know if I did it right!
If you are reading this email, would you mind replying with something like “It worked!” so that I know I didn’t screw up the migration, please? 🙏
The trick of the 327 disappearing subscribers
Like I said, I moved my newsletter from ConvertKit to beehiiv.
To do that, I had to export the subscriber data from ConvertKit and import it into beehiiv.
When I exported the data from ConvertKit, I got a CSV file with 16,099 rows.
However, when I imported it to beehiiv, I only got 15,772 subscribers…
In this email I’ll tell you how I used Python to figure out why I lost 327 subscribers in this process and who those 327 subscribers were.
Reading CSV files with the module csv
The module csv
is a Python module from the standard library that lets you read CSV files easily.
Suppose you have a CSV file called data.csv
with this data:
name,email
Rodrigo,[email protected]
John,[email protected]
Mary,[email protected]
You can use the module csv
to read this data in four simple steps:
import the module
open the file
create a “CSV reader”
iterate over the reader to get the data
This is what those steps look like in code:
import csv
with open("data.csv", "r") as file:
reader = csv.reader(file)
for row in reader:
print(row)
If you run the code above, this is what you get:
['name', 'email']
['Rodrigo', '[email protected]']
['John', 'joh[email protected]']
['Mary', '[email protected]']
As you can see, the module csv
parsed each line of the file and returned the data of the CSV row in the form of a list.
Set comprehension with all emails
When I exported the data from ConvertKit I got a much bigger CSV file.
The file had more rows and more columns.
But the first thing I did was build a Python set of all the emails that I exported out of ConvertKit.
I repeated the steps above, but instead of printing each CSV row I used a set comprehension to extract the emails:
import csv
with open("convertkit_export.csv", "r") as file:
reader = csv.reader(file)
ck_emails = {row[1] for row in reader}
This produced a Python set with 16,099 email addresses.
(You can read this article about set
if you’re not familiar with that concept and this article about comprehensions.)
Next, I had 15,772 subscribers in beehiiv.
I asked beehiiv to export all my subscribers and did a similar thing to get a set with 15,772 unique email addresses:
with open("beehiiv_export.csv", "r") as file:
reader = csv.reader(file)
beehiiv_emails = {row[1] for row in reader}
Comparing the sets
At this point, I have two different sets:
ck_emails
beehiiv_emails
What I wondered was “What emails did I lose in the process?”.
In other words, what emails are there inside the set ck_emails
but that are not in the set beehiiv_emails
?
How would you go about computing this?
Use the two sets below as a test:
ck_emails = {"[email protected]", "[email protected]", "[email protected]"}
beehiiv_emails = {"[email protected]", "[email protected]"}
# Write some code here to figure out what
# emails are in the ck set but not in the beehiiv one.
...
# This should show John's email, only.
print(ck_but_not_in_beehiiv_emails)
What we want to calculate is the set difference of the two sets and Python sets support that operation.
You can even use the operator -
(the operator minus computes the difference between two numbers) to compute the difference between two sets:
>>> ck_emails - beehiiv_emails
{'[email protected]'}
So, what I did was use the operator -
to compute the difference between the two sets and I got a set with 327 emails that I couldn’t upload to beehiiv:
diff = ck_emails - beehiiv_emails
I didn’t read every single entry in that set, but when I looked at the set, I saw something like this:
# ...
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'17[email protected]',
'[email protected]',
# ...
In case you don’t know, those are emails from services that let you create disposable emails.
What probably happened is that people used disposable emails to get my Pydon’ts book, which is free, instead of using their own email.
The joke is on them, because now they won’t get the free updates that the book gets from time to time 😆
Sets are great, learn about them
There are two things I want you to take away from this email:
Knowing just a little bit of Python can really help you out with problems in your daily life!
Sets are great, learn more about them. If you don’t know the operations that sets support that well, check out the docs page about sets. You won’t regret it.
🐍🚀 How was this email? |