数据提取 - CFspider API 文档

概述

CFspider 提供强大的数据提取功能，支持 CSS 选择器、XPath 和 JSONPath 三种方式，可以轻松从 HTML 和 JSON 响应中提取所需数据。

快速开始

python

import cfspider

response = cfspider.get("https://example.com")

# 最简单的用法
title = response.find("h1")  # 自动识别选择器类型

# CSS 选择器
title = response.css("h1.title")

# XPath
links = response.xpath("//a/@href")

# JSONPath（用于 JSON 响应）
data = response.jpath("$.data.items[0].name")

# 批量提取
result = response.pick(
    title="h1",
    links=("a", "href")
)
result.save("output.json")

response.find()

查找第一个匹配的元素，自动识别选择器类型（最简单的 API）。

函数签名

python

response.find(
    selector: str,
    attr: str = None,
    strip: bool = True,
    regex: str = None,
    parser: Callable = None
) -> Optional[str]

参数说明

参数	类型	默认值	说明
`selector`	`str`	必填	选择器，自动识别类型：以 $ 开头：JSONPath（如 `"$.title"`）以 // 开头：XPath（如 `"//h1/text()"`）其他：CSS 选择器（如 `"h1.title"`）
`attr`	`str`	`None`	要提取的属性名（如 "href", "src"）。None 表示提取文本内容
`strip`	`bool`	`True`	是否去除文本首尾空白字符
`regex`	`str`	`None`	正则表达式，用于从提取结果中进一步提取
`parser`	`Callable`	`None`	自定义解析函数，用于转换提取结果（如 `int`, `float`）

使用示例

python

# CSS 选择器
title = response.find("h1")
title = response.find("h1.title")
title = response.find(".product-title")

# 提取属性
link = response.find("a", attr="href")
image = response.find("img", attr="src")

# XPath
title = response.find("//h1/text()")
links = response.find("//a/@href")

# JSONPath
name = response.find("$.data.name")
price = response.find("$.products[0].price")

# 使用正则表达式
price = response.find(".price", regex=r"\d+\.\d+")

# 使用解析函数
price = response.find(".price", parser=float)
count = response.find(".count", parser=int)

response.find_all()

查找所有匹配的元素，返回列表。

函数签名

python

response.find_all(
    selector: str,
    attr: str = None,
    strip: bool = True
) -> List[str]

使用示例

python

# 提取所有标题
titles = response.find_all("h2")

# 提取所有链接
links = response.find_all("a", attr="href")

# XPath 提取所有
items = response.find_all("//div[@class='item']")

# JSONPath 提取所有
names = response.find_all("$.products[*].name")

CSS 选择器

使用 CSS 选择器提取 HTML 元素。

response.css()

提取第一个匹配的元素。

python

# 提取文本
title = response.css("h1")
title = response.css(".product-title")
title = response.css("#main-title")

# 提取属性
link = response.css("a", attr="href")
image = response.css("img", attr="src")

# 提取 HTML
html = response.css("div.content", html=True)

response.css_all()

提取所有匹配的元素。

python

# 提取所有标题
titles = response.css_all("h2")

# 提取所有链接
links = response.css_all("a", attr="href")

# 提取所有图片
images = response.css_all("img", attr="src")

response.css_one()

返回第一个匹配的 Element 对象，支持链式操作。

python

# 链式操作
product = response.css_one(".product")
title = product.find("h1")
price = product.find(".price")
link = product.find("a", attr="href")

# Element 对象属性
element = response.css_one("#main")
print(element.text)    # 文本内容
print(element.html)    # HTML 内容
print(element["href"]) # 获取属性
print(element.attrs)   # 所有属性

XPath

使用 XPath 表达式提取数据，需要安装 lxml：pip install lxml

response.xpath()

提取第一个匹配。

python

# 提取文本
title = response.xpath("//h1/text()")
title = response.xpath("//div[@class='title']/text()")

# 提取属性
link = response.xpath("//a/@href")
image = response.xpath("//img/@src")

# 复杂表达式
price = response.xpath("//div[@class='product']/span[@class='price']/text()")

response.xpath_all()

提取所有匹配。

python

# 提取所有标题
titles = response.xpath("//h2/text()")

# 提取所有链接
links = response.xpath("//a/@href")

# 提取所有产品价格
prices = response.xpath("//div[@class='product']//span[@class='price']/text()")

response.xpath_one()

返回第一个匹配的 Element 对象。

python

product = response.xpath_one("//div[@class='product']")
title = product.find("h1")
price = product.find(".price")

JSONPath

使用 JSONPath 表达式提取 JSON 数据，需要安装 jsonpath-ng：pip install jsonpath-ng

response.jpath()

提取第一个匹配的值。

python

# 基本路径
name = response.jpath("$.name")
price = response.jpath("$.product.price")

# 数组索引
first_item = response.jpath("$.items[0].name")
last_item = response.jpath("$.items[-1].name")

# 通配符
all_names = response.jpath("$.items[*].name")

# 过滤
filtered = response.jpath("$.items[?(@.price > 100)].name")

response.jpath_all()

提取所有匹配的值。

python

# 提取所有名称
names = response.jpath_all("$.items[*].name")

# 提取所有价格
prices = response.jpath_all("$.products[*].price")

批量提取

使用 pick() 方法批量提取多个字段。

response.pick()

批量提取多个字段，返回 ExtractResult 对象，支持直接保存。

函数签名

python

response.pick(**fields) -> ExtractResult

参数说明

**fields 是字段名到选择器的映射，支持以下格式：

字符串：CSS 选择器，提取文本（如 title="h1"）
元组 (selector, attr)：提取属性（如 links=("a", "href")）
元组 (selector, attr, converter)：提取并转换（如 price=(".price", "text", float)）

使用示例

python

# 基本用法
data = response.pick(
    title="h1",
    description=".description"
)

# 提取属性
data = response.pick(
    title="h1",
    links=("a", "href"),
    images=("img", "src")
)

# 类型转换
data = response.pick(
    title="h1",
    price=(".price", "text", float),
    count=(".count", "text", int)
)

# 混合使用
data = response.pick(
    title="h1",                    # CSS 文本
    link=("a", "href"),            # CSS 属性
    api_data="$.data.name",        # JSONPath
    xpath_data="//div/text()"      # XPath
)

# 保存结果
data.save("output.json")  # 自动识别格式
data.save("output.csv")
data.save("output.xlsx")

ExtractResult 对象

pick() 方法返回 ExtractResult 对象，继承自 dict，支持以下方法：

方法

方法	说明
`save(filepath, **kwargs)`	保存提取结果到文件，根据扩展名自动选择格式（JSON/CSV/Excel/SQLite）
`to_json(**kwargs)`	转换为 JSON 字符串

属性

url (str): 请求的 URL

使用示例

python

data = response.pick(title="h1", price=".price")

# 作为字典使用
print(data["title"])
print(data["price"])

# 保存到文件
data.save("output.json")
data.save("output.csv")
data.save("output.xlsx")

# 转换为 JSON
json_str = data.to_json(indent=2)
print(json_str)

依赖安装

可选依赖

数据提取功能需要安装以下可选依赖：

beautifulsoup4：CSS 选择器支持（通常已安装）
lxml：XPath 支持，安装：pip install lxml 或 pip install cfspider[xpath]
jsonpath-ng：JSONPath 支持，安装：pip install jsonpath-ng 或 pip install cfspider[jsonpath]

或安装所有依赖：pip install cfspider[extract]