World Cup players' hype on Instagram

Javier is visiting me for a couple of days here in Barcelona. This silly guy is trying to measure the relation between social media representativeness and expectations of qualifying in the group phase of the World Cup. It seems that he is collecting the information going from page to page… I as a benevolent friend create a classic scraping function to make his life easier (maybe it can make your life better as well)

Packages

require(rvest)
require(tidyverse)
require(ggExtra)
require(plotly)

Code

getIG <- function(x){
    s1 <- paste0("https://www.instagram.com/", x, "/?hl=en")
    s2 <- s1 %>%
        read_html()%>%
        html_nodes("body") %>%
        html_node("script") %>%
        as.character() %>%
        stringr::str_remove_all('[\"]')
    followers <- s2 %>%
        stringr::str_extract_all("(?<=edge_followed_by:[{]count:)[0-9]+", simplify = T) %>%
        as.numeric()
    username <- s2 %>%
        stringr::str_extract_all("(?<=username:)(.*)(?=,connected_fb_page)", simplify = T)
    following <- s2 %>%
        stringr::str_extract_all("(?<=edge_follow:[{]count:)[0-9]+", simplify = T) %>%
        as.numeric()
    s3 <- data.frame(username, followers, following, stringsAsFactors = F) %>%
        tbl_df
    return(s3)
}

Testing

# Let's try for three of the most famous players

purrr::map(c("cristiano", "leomessi", "neymarjr"), getIG) %>%
   bind_rows()

Graph

dbIG <- readr::read_csv2("data/IG.csv")

g1 <- dbIG %>%
   ggplot(aes(followers_20180620, following_20180620, label=username))+
   scale_x_log10()+
   scale_y_log10()+
   theme_minimal()+
   labs(title="World Cup players: Followers vs following on Instagram",
        y="Following (log)", x="Followers (log)")+
   stat_density2d(aes(fill=..level..,alpha=..level..),geom='polygon',colour='black')+
   geom_point(size=0.7)+
   scale_fill_continuous(low="green",high="red")+
   guides(alpha="none")

plotly::ggplotly(g1)
Obryan Poyser Calderón
Obryan Poyser Calderón
Senior Data Scientist

My area of expertise include Time Series Forecast and Inference, Machine Learning and Econometrics.