What is Web Data Extraction | Web Scrub - Advanced Data Structuring Solutions

Introduction

Begin your journey into web scraping, web crawling, data extraction, and explore popular tools to help you start building your own web scrapers.

You might have come across various terms like web scraping, crawling, data extraction, or mining. While these terms are often used interchangeably, let’s clarify the definitions we'll use throughout this beginner-friendly course on web scraping.

What is Web Data Extraction?

Web data extraction, also known as data collection, refers to the process of gathering specific information from web pages, such as extracting a product’s name and price from a webpage like Amazon. Since web pages typically present information in an unstructured format (like HTML documents or API responses), web data extraction helps convert that unstructured data into a structured format, making it easier to analyze or integrate into business systems. This extracted data can come from a variety of sources, including HTML files, APIs, images, PDFs, and more.

What is Crawling?

Crawling (often referred to as "spidering" 🕷) is the process of navigating from one webpage to another in search of relevant data. While web data extraction targets a specific webpage, web crawling is all about exploring multiple pages across a website to gather information. This can happen simultaneously with extraction, where a tool navigates and collects data from each page it encounters, or separately, where one tool identifies relevant URLs while another handles the actual data extraction. Crawling’s main purpose is to compile a list of URLs or links that guide the extraction process.

What is Web Scraping?

Web scraping is a broader term encompassing both web data extraction and crawling, as well as any other techniques used to convert unstructured data from websites into structured formats that can be used for analysis or integration into systems. As you progress through more advanced lessons, you’ll discover that web scraping involves more than just working with HTML and URLs—there are various techniques and complexities involved.

What’s Next?

In the upcoming lesson, we’ll dive into the basic elements of every web page, including HTML, CSS, and JavaScript, laying the foundation for understanding how web pages are structured and how you can begin extracting data.

Go to Basic Data Extraction.

Introduction