This tutorial will give you a step by step guide to scraping Instagram data around a hashtag that you can define yourself. It uses a special Python script and Jupyter Notebook to achieve this goal. At the end of the tutorial you will have at your disposal a downloaded collection of Instagram pictures and videos, displayed in an HTML file as well as a data frame containing all the hashtags used around the search query that defines your sample.
This guide is actually summary of a workshop that I took at 2017s DMRC Summer School. The workshop was designed and led by Patrik Wikström showing us (and now you) how to collect the data from Instagram using web browser automation. Therefore you can do Instagram research without needing to access the nowadays increasingly hedged Instagram API.
All credit for the instagrab script, the knowledge and hard work here goes to Patrik. All mistakes however are on my account. If you encounter any problems or want to provide feedback, workarounds, additions you can contact me.
For more information:
- Read more about web scraping here: https://en.wikipedia.org/wiki/Web_scraping
- You can read more about Jupyter here: http://jupyter.org/ and you can read more about Python here: https://www.python.org/
- Patriks instagrab script is one of many digital method tools developed at the Digital Media Research Center at Brisbanes Queensland Iniversity of Technology. You can find more information here: https://github.com/qut-dmrc and here: https://www.qut.edu.au/research/our-research/institutes-centres-and-research-groups/digital-media-research-centre
- Read about GeckoDriver here: http://toolsqa.com/selenium-webdriver/how-to-use-geckodriver/
- And about Selenium for Python here http://selenium python.readthedocs.io/index.html
1. Setting up your environment
First you need to gather all the necessary files and install the software to run Patriks script:
- Update/download Firefox
- Download Python 3.x and Jupyter Notebook. I recommend that you download Anaconda 4.3.0 (https://www.continuum.io/downloads), which includes both Python and Jupyter Notebook
- Install Anaconda through executing the installing routine
- Create a folder on your computer for your project. (I named mine „instagrab“ and placed it on my desktop, but you can choose something else)
- Download the latest pack of Patrik Wikstroms scripts here: https://github.com/qut-dmrc/build-a-bot
- Copy the Instagraber script and the file called „default-config.json“ in your project folder. (You can also move/copy all the files but you will only need these two).
- Download the geckodriver here: https://github.com/mozilla/geckodriver/releases/tag/v0.14.0 Pick the one appropriate for your system and extract the file. Move the file „geckodriver“ to your project folder
2. Install Selenium
- Open terminal (Mac) or the command prompt (PC)
- Install the browser automation package selenium by running
pip install selenium
3. Starting up Jupyter Notebook and loading the script
Navigate to the folder in the terminal/command prompt you can see how to do this here:
Launch Jupyter Notebook by typing
A new tab should open in your browser running Jupyter Notebook. It should look something like this:
Load the instagrab script (instagrab.ipynb) by clicking on the file. The content should be looking something like this:
The python script is now loaded in Jupyter Notebook where it can be executed to extract content from Instagram. However you first need to edit the script a little in order to access and scrape the content that you want to research.
4. Inputing your Project Details
First you need to be sure about what content you would like to access and scrape. The instagrab script uses a hashtag to collect the content. Patrik seems to be interested in fishing culture, which is why he programmed the script to look for the hashtag „catchoftheday“. You will most likely look for something else so you need to tell the script to look for another hashtag. I am interested in brands and music so I will tell my script to look for posts related to Absolut Vodka at the Lollapalooza Festival using the hashtag #absolutlolla. To program the script for your purposes:
- Replace the content of Line 2 behind the_item with the Hashtag that you want to research. It goes between the brackets without the #
- Set the number of photos you want to collect: Replace the content of Line 2 behind „may_scrape“ with the number of photos you want to collect. Remember that Instagram will not allow you to to collect all the post at once or it will detect that you are using automation on their sites so you will have to go several times to collect the desired amount. 150 posts is a number that seems to work well.
Your code should now look something like this:
the_item = „absolutlolla“
max_scrape = 150
insta_path = „explore/tags/“
prefix = „photos“
5. Execute the script
Now we need to execute the script step by step. You do this by selecting a code block through clicking on it (e.g. click on „ln 1“) and then clicking the button that looks like the play sign of your music player.
Clicking the ‚play button‘ will tell Jupyter Notebook to execute the selected code block in python and jump on to the next one. So after executing the first block (which in this script will actually do nothing as is is ‚just‘ a comment telling you who programed the script), Jupyter will jump to block number two. You can execute this one as well. Now the variables you defined will be set in the environment.
Execute all the blocks until Jupyter jumps to „Set up the web driver“ just after ‚Line 8‘ (In8). Now your script is loaded and prepared and it will have created some additional folders in your project folder. The collected content and data will be stored. (You can check on this by looking at your folder again)
Now its time to do start the browser and the Instagram webpage. All this is done by the script.
Execute all lines/code blocks under „Set up the web driver“. You will notice that a new instance of the Firefox browser will open and automatically navigate to your Instagram Hashtag.
Now it gets a little tricky: Sometimes Instagram requires us to log in before the script can hit the „Load More“ button. The script needs to do this before it can scroll down the page to get all the pictures. If that happens, log in to Instagram in the window that just opened using your credentials. Then, in order to get back to the hashtag query, execute once more the line where it says:
The script will now automate all interaction with the Instagram page to scrape the defined number of posts. If you execute all the code blocks under „Send key strokes to scroll down the page“, the script will scroll down the page so that it shows the number of posts you told it to scrape. Give the script time though after executing each block until the process is finished until it looks something like this.
If no error occurred, you are now ready to scrape all the posts by executing the next set of commands. Some of these commands take up to a few minutes of time e.g. to download the pictures or videos. This is indicated by a dotted progress bar under the block so take it slow until the processes have finished.
6. Look at the data
If all the processes have finished you can go ahead and execute the rest of the command blocks and your done! Now the script will scrape the data and create some files for you that you can use for further research. These files are:
- A html file the presents a grid of all your scraped posts displaying the metadata you collected
- s_cap.csv – A comma seperated data frame with all the hashtags used in the caption of a picture
- s_comm.scv – A comma seperated data frame with all the hashtags used in comment section of a post
- s_combined.csv – You’ve guessed it. This dataframe contains bot the comments and the caption hashtags.
- s_labels.csv – An empty file which is due to a feature not supported in this version of the script.
You can import the csv files into your analysis software like Excel, LibreOffice, Gephi or else to work with the data now.
Step by Step: Collecting Instagram Data ist lizenziert unter einer Creative Commons Namensnennung – Weitergabe unter gleichen Bedingungen 4.0 International Lizenz.