Coding

How to Retrieve and Analyze Your iOS Messages with Python, Pandas and NLTK

I’m one of those people that keeps every text message I send or receive — I never delete them. Meet a girl at a bar, text her the next day and never hear back from her? I keep that. Weird wrong-number texts? I keep those too. Ex-girlfriend texts? Definitely keepers.

I had 65,378 messages on my phone at the time of writing this post.

I’m not a digital hoarder or anything, but I primarily do this because I like the idea of being able to search through the past. But, digital hoarder or not, collecting anything takes up some sort of space, and when I found that my text messages were taking up 4GBs of space on my phone, I decided it was time to back them up. It was at that point that I realized I could also probably analyze them.

As it turns out, you can do this, and I’ll tell you how. For this project, I used Python/Pandas/NLTK for the analysis and an iPython Notebook to render the datasets. I’ve also uploaded the code to GitHub, which you can view here.

An overview of the steps to make this happen:

  1. Sync/back up your iPhone because the messages need to be stored on your computer.
  2. Load the SQLite file and retrieve all messages
    • You can follow the directions for retrieving the right file here.
  3. Analyze those mensajes (I used Pandas)!!

Let’s get into some details.

You need to sync and back up your phone’s contents to your computer. There’s a great post on how to do this here. In case you want to skip that read, you’re ultimately getting a file with the text messages in it; copying it and moving it into your working directory.

You can find the file with this bash command:

$ find / -name 3d0d7e5fb2ce288813306e4d4636395e047a3d28

Now, loading the SQLite file — you can actually see what’s in this file via the command line:

 $ sqlite3 3d0d7e5fb2ce288813306e4d4636395e047a3d28 

Then you can check out the available tables:

sqlite> .tables
_SqliteDatabaseProperties chat_message_join
attachment handle
chat message
chat_handle_join message_attachment_join

From here, the main tables I found useful were “message” and “handle.” The former contains all of your text messages, and the latter contains all of the senders/recipients. I only wrote code around the messages table, primarily because I could never figure out how to make a join between message and handle, but that was probably something trivial that I overlooked. Please tell me how you did it, if you did!

Continuing on, the message table has lots of columns in it, and I chose to select from the following:

['guid', 'service', 'text', 'date', 'date_delivered', 
'handle_id', 'type', 'is_read','is_sent', 'is_delivered',
'item_type', 'group_title']

The key field is “text,” which is where the content of the message is stored, which includes emojis! (A cool thing is that your emojis will show up if you try to plot them in something like an iPython notebook. You could run an entire analysis on emoji usage…)

My analysis, however, ultimately breaks down into two pieces:

  1. Analyzing the content of the “text” field (excluding emojis).
  2. Analyzing the messages themselves (for example, total text messages, or, what I sent vs. what I received, for instance).

For #1, I wrote code that:

  • Classifies all words and assigns a part of speech to them, then check the counts of each part of speech.
    • You should get a table looking like this.

      You should get a table looking like this.

  • Counts the number of times each word appears in the dataset, and gives an overview of the dataset:
    • total_words_filtered
  • Excludes boring words, like prepositions, and words that are < 2 characters.
  • Classifies all words as is_bad=1 or 0. I did this by using a .txt file full of bad words, found here:
  • Plots usage of bad words
    • I’d love to show you my plot, but let’s just assume I never swear…

For #2, the code allows you to:

  • Plot the number of text messages received each day (check out the spike on your birthday or during holidays). You can see my data below has a huge gap (that’s when my phone was replaced and not backed up for many months. My timestamp conversions are also apparently incorrect, but I haven’t looked into it.
    • The timestamp conversion is off, so someone can fix that... we're not in 2016, yet... Are we??

      The timestamp conversion is off, so someone can fix that… we’re not in 2016, yet…Or am I??

  • Count the number of sent versus received messages.

Anyway, I hope you can get some use out of this, and instead of blabbing on about the code here, I’ll just let you read it and use it on your own. Please check out my git repo, and please reach out to me with questions, comments, etc.

Standard
Coding, How-To

How to Create Geo HeatMaps with Pandas Dataframes and Google Maps JavaScript API V3

Get excited because we’re going to make a heatmap with Python Pandas and Google Maps JavaScript API V3. I’m assuming the audience has plenty of previous knowledge in Python, Pandas, and some HTML/CSS/JavaScript. Let’s begin with the DataFrame.

The DataFrame

First, you’re going to need a dataframe of “addresses” (can be a physical address, or even just a country name, like USA) that you eventually want to plot. (For the sake of simplicity, I’ll try to refer to the “address” as the “geo” for the rest of this document.) Second, since you are planning on using a heatmap, you’re going to want some sort of number that represents the weighted value of that row in comparison to other rows.

Let’s say your DataFrame looked like this:

grouped_country_df = main_df.groupby('country')\
                            .agg({'pink_kitten': lambda x: len(x.unique())})\
                            .sort('pink_kitten', ascending=False)
print grouped_country_df
geo_name count_of_pink_kittens
USA 3430
Spain 577
United Kingdom 352
Israel 292
Austria 196
Argentina 151
India 133
Singapore 66

Now you have a list of geos and some values to use as the weight when later creating the heatmap. But to plot these points, you’re going to need some lat and long coordinates.

Getting Lat Long Coordinates from Google Maps API

If you have a list of geos or “addresses,” you can use Geocoding to convert those geos into lat/long coordinates. From Google: “Geocoding is the process of converting addresses (like “1600 Amphitheatre Parkway, Mountain View, CA”) into geographic coordinates (like latitude 37.423021 and longitude -122.083739), which you can use to place markers on a map, or position the map.”

To use this Google Maps service, you need to have a Google Maps API key. To get a key, you can follow the directions here. When you sign up for an API key, you should select “Server Side Key,” since we will be running a Python script server-side to access the Google Maps API.

Once you have your api_key, you can work on getting geocoded results for all of your geos. You can do this with the following code:

import requests
# set your google maps api key here.
google_maps_api_key = ''

# get the list of countries from our DataFrame.
countries = grouped_country_df.index
for country in countries:
    # make request to google_maps api and store as json. pass in the geo name to the address 
    # query string parameter.
    url ='https://maps.googleapis.com/maps/api/geocode/json?address={}&amp;key={}'\
         .format(country, google_maps_api_key)
    r = requests.get(url).json()

    # Get lat and long from response. "location" contains the geocoded lat/long value.
    # For normal address lookups, this field is typically the most important.
    # https://developers.google.com/maps/documentation/geocoding/#JSON

    lat = r['results'][0]['geometry']['location']['lat']
    lng = r['results'][0]['geometry']['location']['lng']

This only gets you so far, since you still need to do something with those latitude and longitude coordinates. We have a few options here:

  1. If you are building a web application, you can pass those values into an HTML template as variables and they will end up getting plotted via JavaScript.
  2. We can print out the format of the JavaScript, and later past it into our HTML file within script tags.
  3. Other approaches that I’m not going to talk about.

For the sake of time, I’m going to show #2, which lends itself to a one-off analysis. You’d probably want to go with some dynamic templating approach, like #1, if you are going to pull and plot the same data repeatedly.

Add the following code to your for-loop from above, right underneath

lng = r['results'][0]['geometry']['location']['lng']

# set the country weight for later. by getting the value for each index in the dataframe
# as it loops through.
country_weight = int(grouped_country_df.ix[country])
 
# print out the Javascript that we will be copy-pasting into our HTML file
print '{location: new google.maps.LatLng(%s, %s), weight: %s},' % (lat, lng, country_weight)

After running your script, copy the output, which should look like this:

{location: new google.maps.LatLng(37.09024, -95.712891), weight: 3430},
{location: new google.maps.LatLng(40.463667, -3.74922), weight: 577},
{location: new google.maps.LatLng(55.378051, -3.435973), weight: 352},
{location: new google.maps.LatLng(31.046051, 34.851612), weight: 292},
{location: new google.maps.LatLng(47.516231, 14.550072), weight: 196},
{location: new google.maps.LatLng(-38.416097, -63.616672), weight: 151},
{location: new google.maps.LatLng(20.593684, 78.96288), weight: 133},
{location: new google.maps.LatLng(1.352083, 103.819836), weight: 66},

You’re going to use these values in the next step.

Creating an HTML file that contains Javascript for Plotting your Lat Long Points.

You need to create an HTML file that contains some script tags within it. I am simply going to paste my code below with annotations. If you copy the location strings from above, you will be able to paste them directly into this HTML file under the “heatmapData” array (defined below in the code).

<!DOCTYPE html>
  <head>
    <title>Simple Map</title>
    <meta name="viewport" content="initial-scale=1.0, user-scalable=no">
    <meta charset="utf-8">
    <style>
      html, body, #map-canvas {
        height: 100%;
        margin: 0px;
        padding: 0px
      }
    </style>
    <!-- Load Google Maps API. -->
    
    
  
    
    function initialize() {
      var heatmapData = [
        {location: new google.maps.LatLng(37.09024, -95.712891), weight: 3430},
        {location: new google.maps.LatLng(40.463667, -3.74922), weight: 577},
        {location: new google.maps.LatLng(55.378051, -3.435973), weight: 352},
        {location: new google.maps.LatLng(31.046051, 34.851612), weight: 292},
        {location: new google.maps.LatLng(47.516231, 14.550072), weight: 196},
        {location: new google.maps.LatLng(-38.416097, -63.616672), weight: 151},
        {location: new google.maps.LatLng(20.593684, 78.96288), weight: 133},
        {location: new google.maps.LatLng(1.352083, 103.819836), weight: 66},
      ];
       
      // Add some custom styles to your google map. This can be a pain. 
        // http://gmaps-samples-v3.googlecode.com/svn/trunk/styledmaps/wizard/index.html
      var styles = [ 
        {
          "featureType": "administrative",
          "stylers": [
            { "visibility": "off" }
          ]
        },
        {
          "featureType": "road",
          stylers: [ 
            { "visibility": "off"}
          ]
        },
        {
          "featureType": "landscape",
          "elementType": "geometry.fill",
          "stylers": [
            { "color": "#ffffff" },
            { "visibility": "on" }
          ]
        },
      ];
      // create a point on the map for the Atlantic Ocean, 
      // which will later be used for centering the map.
      var atlanticOcean = new google.maps.LatLng(24.7674044, -38.2680446);
      // Create the styled map object.
      var styledMap = new google.maps.StyledMapType(styles, {name:"Styled Map"});
      // create the base map object. put it in the map-canvas id, defined in HTML below.
      map = new google.maps.Map(document.getElementById('map-canvas'), {
        center: atlanticOcean, // set the starting center point as the atlantic ocean
        zoom: 3, // set the starting zoom 
        mapTypeControlOptions: {
          mapTypeIds: [ google.maps.MapTypeId.ROADMAP, 'map_style'] // give the map a type.
        }, 
      });
       
      // Create the heatmap object.
      var heatmap = new google.maps.visualization.HeatmapLayer({
        data: heatmapData, // pass in your heatmap data to plot in this layer.
        opacity: 1, 
        dissipating: false, // on zoom, do you want dissipation?
      });
      heatmap.setMap(map); // apply the heatmap to the base map object.
      map.mapTypes.set('map_style', styledMap); // apply the styles to your base map.
      map.setMapTypeId('map_style'); 
       
      // Add a custom Legend to Your Map
        // https://developers.google.com/maps/tutorials/customizing/adding-a-legend
      var legend = document.getElementById('legend');
      map.controls[google.maps.ControlPosition.RIGHT_BOTTOM]
         .push(document.getElementById('legend'));
       
      // This is hard-coded for the countries I knew existed in the set.
      var country_list = ['USA','Spain','United_Kingdom','Israel',
                          'Austria','Argentina','India','Singapore'];
       
      // for each country in the country list, append it to the Legend div.

      for (i = 0; i < country_list.length; i++) {
          var div = document.createElement('div');
          div.innerHTML = '<p>' + country_list[i] + '</p>'
          legend.appendChild(div);
      } 
    }

     google.maps.event.addDomListener(window, 'load', initialize);

</script>
</head>

<body>
    <'div id="legend" style="background-color:grey;padding:10px;">
    <strong>Countries Mapped</strong>
    </div>

    <'div id="map-canvas"></div>
    </body>
</html>

Open the HTML file in your browser, and you should see something like this.

google maps heatmap

Et Voila!

Standard