In the realm of programming languages, Go (Golang) stands out for its simplicity, performance, and robust concurrency model. Whether you're developing microservices, cloud-native applications, or system-level software, Go provides the tools you need to build efficient and scalable solutions. In this blog post, we’ll dive deep into building a Concurrent Web Crawler using Go, exploring advanced features such as goroutines, channels, worker pools, rate limiting, and error handling. By the end of this post, you'll have a comprehensive understanding of how to leverage Go's capabilities to create a powerful web crawler suitable for real-world applications.
Introduction to Web Crawlers
A web crawler (also known as a spider or bot) is a program that systematically browses the World Wide Web, typically for the purpose of web indexing (e.g., search engines like Google). Crawlers traverse the web by following hyperlinks from one page to another, fetching and processing data as they go.
Building an efficient web crawler involves handling multiple tasks concurrently, managing resources effectively, and ensuring that the crawler behaves politely by respecting the target server's rate limits. Go's concurrency model makes it an excellent choice for such a task.
Project Overview
Our goal is to build a concurrent web crawler in Go with the following features:
Concurrency: Efficiently handle multiple page fetches simultaneously.
Worker Pool: Limit the number of concurrent workers to prevent resource exhaustion.
Rate Limiting: Respect target servers by limiting the rate of requests.
Error Handling: Gracefully handle errors during fetching and parsing.
Data Parsing: Extract specific information from fetched web pages.
Data Storage: Store the extracted data in a structured format.
Setting Up the Project
Before diving into the code, let's set up our Go project structure.
Initialize the Project:
mkdir go-web-crawler
cd go-web-crawler
go mod init go-web-crawler
Directory Structure:
go-web-crawler/
├── main.go
├── crawler/
│ ├── crawler.go
│ ├── worker.go
│ └── rate_limiter.go
├── parser/
│ └── parser.go
└── storage/
└── storage.go
Dependencies:
go get github.com/PuerkitoBio/goquery
go get golang.org/x/time/rate
Implementing Concurrency with Goroutines and Channels
Key Concepts:
Example: Basic Goroutine and Channel Usage
package main
import (
"fmt"
"time"
)
func worker(id int, jobs <-chan string, results chan<- string) {
for url := range jobs {
// Simulate work
time.Sleep(time.Second)
results <- fmt.Sprintf("Worker %d processed %s", id, url)
}
}
func main() {
jobs := make(chan string, 10)
results := make(chan string, 10)
// Start 3 workers
for w := 1; w <= 3; w++ {
go worker(w, jobs, results)
}
// Send 5 jobs
urls := []string{
"https://example.com",
"https://golang.org",
"https://github.com",
"https://stackoverflow.com",
"https://medium.com",
}
// Collect results
for i := 0; i < len(urls); i++ {
fmt.Println(<-results)
}
In this example, we create two channels: jobs for sending URLs to workers and results for receiving processed results.
We launch three worker goroutines, each waiting for URLs to process.
We send five URLs into the jobs channel and close it to indicate no more jobs will be sent.
Finally, we collect and print the results.
This simple example demonstrates how goroutines and channels can be used to process tasks concurrently.
Creating a Worker Pool
Implementing the Worker Pool:
// crawler/crawler.go
package crawler
import (
"go-web-crawler/parser"
"go-web-crawler/storage"
"log"
"net/http"
)
type Crawler struct {
Jobs chan string
Results chan storage.PageData
WorkerCount int
RateLimiter *RateLimiter
Visited map[string]bool
}
func NewCrawler(workerCount int, rateLimiter *RateLimiter) *Crawler {
return &Crawler{
Jobs: make(chan string, 100),
Results: make(chan storage.PageData, 100),
WorkerCount: workerCount,
RateLimiter: rateLimiter,
Visited: make(map[string]bool),
}
}
func (c *Crawler) Start() {
for i := 0; i < c.WorkerCount; i++ {
go worker(i, c.Jobs, c.Results, c.RateLimiter, c)
}
}
Implement the Worker:
// crawler/worker.go
package crawler
import (
"go-web-crawler/parser"
"go-web-crawler/storage"
"net/http"
"sync"
)
func worker(id int, jobs <-chan string, results chan<- storage.PageData, rateLimiter *RateLimiter, crawler *Crawler) {
for url := range jobs {
// Check if already visited
crawlerMarkVisited(crawler, url)
// Rate limit
rateLimiter.Wait()
// Fetch the URL
resp, err := http.Get(url)
if err != nil {
log.Printf("Worker %d: Error fetching %s: %v\n", id, url, err)
continue
}
// Parse the page
data, links, err := parser.Parse(resp)
if err != nil {
log.Printf("Worker %d: Error parsing %s: %v\n", id, url, err)
resp.Body.Close()
continue
}
resp.Body.Close()
// Send the result
results <- data
// Enqueue new links
for _, link := range links {
crawlerEnqueue(crawler, link)
}
}
}
func crawlerMarkVisited(crawler *Crawler, url string) {
// Simple synchronization; for production use sync.Map or other concurrency-safe structures
if !crawler.Visited[url] {
crawler.Visited[url] = true
}
}
func crawlerEnqueue(crawler *Crawler, url string) {
// Enqueue only if not visited
if !crawler.Visited[url] {
crawler.Jobs <- url
}
}
Implementing Rate Limiting
Implementing Rate Limiter:
// crawler/rate_limiter.go
package crawler
import (
"golang.org/x/time/rate"
"time"
)
type RateLimiter struct {
limiter *rate.Limiter
}
func NewRateLimiter(r rate.Limit, b int) *RateLimiter {
return &RateLimiter{
limiter: rate.NewLimiter(r, b),
}
}
func (rl *RateLimiter) Wait() {
rl.limiter.Wait(context.Background())
}
RateLimiter Struct: Wraps Go's rate.Limiter.
// main.go
rateLimiter := crawler.NewRateLimiter(2, 5) // 2 requests per second with a burst of 5
Parsing and Storing Data
Parsing the Web Page:
// parser/parser.go
package parser
import (
"go-web-crawler/storage"
"github.com/PuerkitoBio/goquery"
"net/http"
"strings"
)
func Parse(resp *http.Response) (storage.PageData, []string, error) {
var data storage.PageData
var links []string
if resp.StatusCode != 200 {
return data, links, fmt.Errorf("received non-200 response code")
}
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
return data, links, err
}
// Extract title
data.Title = doc.Find("title").Text()
// Extract meta description
data.Description, _ = doc.Find("meta[name='description']").Attr("content")
// Extract all links
doc.Find("a[href]").Each(func(i int, s *goquery.Selection) {
href, exists := s.Attr("href")
if exists {
link := normalizeURL(resp.Request.URL, href)
if link != "" {
links = append(links, link)
}
}
})
return data, links, nil
}
func normalizeURL(base *url.URL, href string) string {
u, err := url.Parse(strings.TrimSpace(href))
if err != nil {
return ""
}
return base.ResolveReference(u).String()
}
Data Storage:
// storage/storage.go
package storage
type PageData struct {
URL string
Title string
Description string
}
Putting It All Together
// main.go
package main
import (
"go-web-crawler/crawler"
"go-web-crawler/storage"
"log"
"os"
"sync"
)
func main() {
// Initialize rate limiter: 2 requests per second with a burst of 5
rateLimiter := crawler.NewRateLimiter(2, 5)
// Initialize crawler with 5 workers
c := crawler.NewCrawler(5, rateLimiter)
c.Start()
// Seed URLs
seedURLs := []string{
"https://golang.org",
"https://www.google.com",
}
// Enqueue seed URLs
for _, url := range seedURLs {
c.Jobs <- url
}
// Close jobs channel when done
go func() {
// Wait for some time or implement a more robust termination condition
// For simplicity, we'll crawl for 30 seconds
time.Sleep(30 * time.Second)
close(c.Jobs)
}()
// Collect results
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
for data := range c.Results {
// Store the data
storage.Save(data)
log.Printf("Crawled: %s - %s\n", data.URL, data.Title)
}
}()
wg.Wait()
}
Saving Data:
// storage/storage.go
package storage
import (
"encoding/json"
"log"
"os"
"sync"
)
var (
file *os.File
mutex sync.Mutex
)
func init() {
var err error
file, err = os.OpenFile("pages.json", os.O_CREATE|os.O_WRONLY|os.O_APPEND, 0644)
if err != nil {
log.Fatalf("Failed to open file: %v", err)
}
}
func Save(data PageData) {
mutex.Lock()
defer mutex.Unlock()
jsonData, err := json.Marshal(data)
if err != nil {
log.Printf("Error marshaling data: %v", err)
return
}
_, err = file.WriteString(string(jsonData) + "\n")
if err != nil {
log.Printf("Error writing to file: %v", err)
}
}
Enhancements and Best Practices
Thread-Safe Visited Map:
type Crawler struct {
Jobs chan string
Results chan storage.PageData
WorkerCount int
RateLimiter *RateLimiter
Visited map[string]bool
Mutex sync.RWMutex
}
func (c *Crawler) MarkVisited(url string) {
c.Mutex.Lock()
defer c.Mutex.Unlock()
c.Visited[url] = true
}
func (c *Crawler) IsVisited(url string) bool {
c.Mutex.RLock()
defer c.Mutex.RUnlock()
return c.Visited[url]
}
Depth Limiting:
Prevent the crawler from going too deep by tracking the depth of each URL.
type Crawler struct {
// ...
MaxDepth int
}
type Job struct {
URL string
Depth int
}
// Modify jobs channel to accept Job instead of string
Jobs chan Job
Politeness Policies:
Respect robots.txt and implement domain-specific rate limits.
Disqus Comments Loading..