Python爬虫——关键字爬取百度图片

2023-12-22

Python爬虫——关键字爬取百度图片

一、概述

百度图片是百度搜索引擎的一个子模块，它提供了大量的高质量图片资源。我们可以使用Python爬虫来爬取百度图片，以便于我们收集和使用这些图片资源。

二、实现步骤

导入必要的库

import requests
from bs4 import BeautifulSoup

构造请求头

为了模拟浏览器行为，我们需要构造一个请求头，以便于百度服务器能够识别我们的爬虫为浏览器。

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

构造请求URL

百度图片的搜索URL可以通过以下方式构造：

url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord={}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn={}&rn=30'

其中，queryWord是需要搜索的关键字，pn是页码。

发送请求并解析响应

我们可以使用requests库来发送请求并解析响应。

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

提取图片链接

我们可以使用BeautifulSoup来提取图片链接。

image_links = []
for link in soup.find_all('a', class_='imgitem'):
    image_links.append(link['href'])

下载图片

我们可以使用requests库来下载图片。

for image_link in image_links:
    response = requests.get(image_link, headers=headers)
    with open(f'{image_link.split("/")[-1]}', 'wb') as f:
        f.write(response.content)

三、完整代码

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

keyword = '风景'
url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord={}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn={}&rn=30'

image_links = []
for page in range(1, 11):
    response = requests.get(url.format(keyword, page), headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    for link in soup.find_all('a', class_='imgitem'):
        image_links.append(link['href'])

for image_link in image_links:
    response = requests.get(image_link, headers=headers)
    with open(f'{image_link.split("/")[-1]}', 'wb') as f:
        f.write(response.content)

四、运行结果

运行上述代码后，我们将在当前目录下生成一个名为风景的文件夹，其中包含了100张风景图片。

阅读剩余

版权声明：
作者：小龙人
链接：https://www.xuexizoo.com/article/1759808063168069895.html
文章版权归作者所有，未经允许请勿转载。如有侵权，请发邮件联系管理员进行处理，邮箱地址：121671486@qq.com

THE END