Why do we need to scrape data?

Data visualization is such a powerful tool. Data is the strongest way to make an argument, communicate an issue, and educate an audience. However, data is only useful when delivered in a way that makes it accessible to those unfamiliar with the material.

To prepare data for visualization, it needs to be organized and stored in a logical way. Data collection is usually done through API calls. In this instance, the data is already prepared in a defined structure and ready to be siphoned into some form of storage whether it’s a SQL database or a JSON object. The majority of data that is available to us isn’t available in a way that is readily accessible in terms of collection.

Data scraping is a process of identifying a pattern in seemingly unorganized data, and structuring it in a way that makes it collectable for storage.

From there, what you decide to do with the data is limitless.

For this workshop, we’ll be using Nokogiri to parse HTML from a website, and identify patterns of data storage within HTML elements.

Key takeaways:

  • All data that is accessible is consumable
  • Consuming data allows us to format it in a way that is digestable for broader audiences
  • Patterns can almost always be identified in unorganized data