I’m one of those people that keeps every text message I send or receive — I never delete them. Meet a girl at a bar, text her the next day and never hear back from her? I keep that. Weird wrong-number texts? I keep those too. Ex-girlfriend texts? Definitely keepers.
I had 65,378 messages on my phone at the time of writing this post.
I’m not a digital hoarder or anything, but I primarily do this because I like the idea of being able to search through the past. But, digital hoarder or not, collecting anything takes up some sort of space, and when I found that my text messages were taking up 4GBs of space on my phone, I decided it was time to back them up. It was at that point that I realized I could also probably analyze them.
As it turns out, you can do this, and I’ll tell you how. For this project, I used Python/Pandas/NLTK for the analysis and an iPython Notebook to render the datasets. I’ve also uploaded the code to GitHub, which you can view here.
An overview of the steps to make this happen:
- Sync/back up your iPhone because the messages need to be stored on your computer.
- Load the SQLite file and retrieve all messages
- You can follow the directions for retrieving the right file here.
- Analyze those mensajes (I used Pandas)!!
Let’s get into some details.
You need to sync and back up your phone’s contents to your computer. There’s a great post on how to do this here. In case you want to skip that read, you’re ultimately getting a file with the text messages in it; copying it and moving it into your working directory.
You can find the file with this bash command:
$ find / -name 3d0d7e5fb2ce288813306e4d4636395e047a3d28
Now, loading the SQLite file — you can actually see what’s in this file via the command line:
$ sqlite3 3d0d7e5fb2ce288813306e4d4636395e047a3d28
Then you can check out the available tables:
sqlite> .tables _SqliteDatabaseProperties chat_message_join attachment handle chat message chat_handle_join message_attachment_join
From here, the main tables I found useful were “message” and “handle.” The former contains all of your text messages, and the latter contains all of the senders/recipients. I only wrote code around the messages table, primarily because I could never figure out how to make a join between message and handle, but that was probably something trivial that I overlooked. Please tell me how you did it, if you did!
Continuing on, the message table has lots of columns in it, and I chose to select from the following:
['guid', 'service', 'text', 'date', 'date_delivered',
'handle_id', 'type', 'is_read','is_sent', 'is_delivered',
'item_type', 'group_title']
The key field is “text,” which is where the content of the message is stored, which includes emojis! (A cool thing is that your emojis will show up if you try to plot them in something like an iPython notebook. You could run an entire analysis on emoji usage…)
My analysis, however, ultimately breaks down into two pieces:
- Analyzing the content of the “text” field (excluding emojis).
- Analyzing the messages themselves (for example, total text messages, or, what I sent vs. what I received, for instance).
For #1, I wrote code that:
- Classifies all words and assigns a part of speech to them, then check the counts of each part of speech.
- Counts the number of times each word appears in the dataset, and gives an overview of the dataset:
- Excludes boring words, like prepositions, and words that are < 2 characters.
- Classifies all words as is_bad=1 or 0. I did this by using a .txt file full of bad words, found here:
- Plots usage of bad words
- I’d love to show you my plot, but let’s just assume I never swear…
For #2, the code allows you to:
- Plot the number of text messages received each day (check out the spike on your birthday or during holidays). You can see my data below has a huge gap (that’s when my phone was replaced and not backed up for many months. My timestamp conversions are also apparently incorrect, but I haven’t looked into it.
- Count the number of sent versus received messages.
Anyway, I hope you can get some use out of this, and instead of blabbing on about the code here, I’ll just let you read it and use it on your own. Please check out my git repo, and please reach out to me with questions, comments, etc.