生成器是 Python 中容易被忽视但极其重要的特性。在处理大规模数据时，生成器可以显著降低内存占用；而在协程和异步编程中，yield 更是核心语法。本文深入探讨生成器的原理和实战应用。

迭代器协议详解

理解生成器之前，需要了解 Python 的迭代器协议。

可迭代对象与迭代器

可迭代对象（Iterable）：实现了 __iter__ 方法，可以使用 for 循环遍历。

迭代器（Iterator）：同时实现了 __iter__ 和 __next__ 方法，可以逐个返回元素。

# 列表是可迭代对象，但不是迭代器
numbers = [1, 2, 3]
print(hasattr(numbers, '__iter__'))  # True
print(hasattr(numbers, '__next__'))  # False

# iter() 获取迭代器
it = iter(numbers)
print(type(it))  # <class 'list_iterator'>
print(next(it))  # 1
print(next(it))  # 2
print(next(it))  # 3
print(next(it))  # StopIteration 异常

for 循环的内部原理

# 表面上的 for 循环
for item in iterable:
    print(item)

# 实际执行过程
iterator = iter(iterable)
while True:
    try:
        item = next(iterator)
        print(item)
    except StopIteration:
        break

自定义迭代器

class Range:
    """模拟 range(start, end) 的迭代器"""

    def __init__(self, start, end):
        self.current = start
        self.end = end

    def __iter__(self):
        return self  # 迭代器必须返回自身

    def __next__(self):
        if self.current >= self.end:
            raise StopIteration
        value = self.current
        self.current += 1
        return value

# 使用
for i in Range(0, 3):
    print(i)  # 0, 1, 2

生成器函数与生成器对象

生成器函数

如果函数中包含 yield 关键字，这个函数就不再是普通函数，而是生成器函数：

def count_up_to(max_val):
    """生成 0 到 max_val 的数字"""
    count = 0
    while count <= max_val:
        yield count
        count += 1

# 调用生成器函数，返回生成器对象
generator = count_up_to(3)
print(type(generator))  # <class 'generator'>
print(list(generator))  # [0, 1, 2, 3]

注意：生成器函数不会立即执行，只有调用 next() 或迭代时才开始执行。

生成器执行过程

def simple_gen():
    print("开始执行")
    yield 1
    print("继续执行")
    yield 2
    print("结束")
    yield 3

gen = simple_gen()
print("生成器已创建")

# 执行到第一个 yield 暂停
print(next(gen))  # 打印 "开始执行" 和 "1"

# 执行到第二个 yield 暂停
print(next(gen))  # 打印 "继续执行" 和 "2"

# 执行到第三个 yield（最后的 yield）
print(next(gen))  # 打印 "结束" 和 "3"

# 再次调用，生成器耗尽
print(next(gen))  # StopIteration 异常

执行流程图：

创建生成器
    │
    ▼
next() 调用 ──────► yield 1 ──────► 暂停，返回 1
    │                               │
    │ next()                        │ next()
    ▼                               ▼
next() 调用 ──────► yield 2 ──────► 暂停，返回 2
    │                               │
    │ next()                        │ next()
    ▼                               ▼
next() 调用 ──────► yield 3 ──────► 暂停，返回 3
    │                               │
    │ next()                        │ StopIteration
    ▼                               ▼
  结束

yield from 用法

yield from 是 Python 3 引入的语法，用于委托给另一个生成器。

基本用法

def gen1():
    yield 1
    yield 2

def gen2():
    yield from gen1()  # 委托给 gen1
    yield 3

# 等价于
def gen2_equivalent():
    yield 1
    yield 2
    yield 3

print(list(gen2()))  # [1, 2, 3]

yield from 的价值

yield from 主要用于两个场景：

1. 链式生成器

def chain(*iterables):
    """连接多个可迭代对象"""
    for it in iterables:
        yield from it

# 效果等同于
def chain_v2(*iterables):
    for it in iterables:
        for item in it:
            yield item

print(list(chain([1, 2], [3, 4], [5, 6])))  # [1, 2, 3, 4, 5, 6]

2. 生成器 delegation

def flatten(nested_list):
    """展平嵌套列表"""
    for item in nested_list:
        if isinstance(item, list):
            yield from flatten(item)  # 递归委托
        else:
            yield item

nested = [1, [2, 3], [4, [5, 6]], 7]
print(list(flatten(nested)))  # [1, 2, 3, 4, 5, 6, 7]

生成器的惰性计算

生成器最核心的优势是惰性求值（Lazy Evaluation）：只在需要时才计算下一个值。

内存效率对比

# 普通方式：一次性加载到内存
def get_squares_list(n):
    return [x ** 2 for x in range(n)]

# 生成器方式：按需计算
def get_squares_gen(n):
    for x in range(n):
        yield x ** 2

# 内存占用对比（n = 10_000_000）
squares_list = get_squares_list(10_000_000)
# 列表占用：约 280 MB

squares_gen = get_squares_gen(10_000_000)
# 生成器占用：约 1 KB（只存储生成器对象）

实际应用场景

# 场景1：读取超大文件
def read_large_file(file_path):
    """逐行读取，不占用大量内存"""
    with open(file_path, 'r') as f:
        for line in f:
            yield line.strip()

# 使用
for line in read_large_file('huge_log.txt'):
    if 'ERROR' in line:
        print(line)

# 场景2：无限序列
def fibonacci():
    """斐波那契数列（无限）"""
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

# 只需要前 N 个，不需要生成全部
fib = fibonacci()
for _ in range(10):
    print(next(fib))  # 0, 1, 1, 2, 3, 5, 8, 13, 21, 34

生成器进阶：send、throw、close

生成器提供了三个高级方法，用于更复杂的控制。

send() 方法

send() 允许向生成器发送值，恢复生成器的同时传递一个值：

def counter():
    count = 0
    while True:
        # 暂停在这里，等待 send() 发送值
        received = yield count
        if received is not None:
            count = received
        else:
            count += 1

gen = counter()
print(next(gen))  # 0，生成器启动，停在 yield
print(gen.send(10))  # 10，重置计数为 10
print(next(gen))  # 11
print(gen.send(5))  # 5，重置为 5
print(next(gen))  # 6

典型应用：协程。send() 是协程通信的基础。

def coro():
    print("协程启动")
    while True:
        value = yield
        print(f"收到值: {value}")

c = coro()
next(c)  # 启动协程
c.send(100)  # 发送 100
c.send(200)  # 发送 200
c.close()  # 关闭协程

throw() 方法

throw() 向生成器抛出异常：

def gen_with_error():
    try:
        yield 1
        yield 2
    except ValueError:
        yield "捕获了 ValueError"
        yield 3

g = gen_with_error()
print(next(g))  # 1
print(next(g))  # 2
print(g.throw(ValueError))  # "捕获了 ValueError"
print(next(g))  # 3

close() 方法

close() 正常终止生成器：

def simple_gen():
    yield 1
    yield 2  # 永远不会执行

g = simple_gen()
print(next(g))  # 1
g.close()  # 关闭生成器
print(next(g))  # StopIteration（不是异常抛出）

实战：内存高效的数据处理管道

结合以上知识，构建一个内存高效的数据处理管道：

def read_users(file_path):
    """读取用户数据"""
    with open(file_path, 'r') as f:
        next(f)  # 跳过表头
        for line in f:
            name, age, city = line.strip().split(',')
            yield {'name': name, 'age': int(age), 'city': city}

def filter_age(users, min_age):
    """过滤年龄"""
    for user in users:
        if user['age'] >= min_age:
            yield user

def group_by_city(users):
    """按城市分组"""
    groups = {}
    for user in users:
        city = user['city']
        if city not in groups:
            groups[city] = []
        groups[city].append(user)
    yield from groups.items()  # yield (city, [users...])

def process_pipeline(file_path):
    """处理管道"""
    users = read_users(file_path)
    adults = filter_age(users, 18)
    grouped = group_by_city(adults)
    return grouped

# 使用
for city, users in process_pipeline('users.csv'):
    print(f"{city}: {len(users)} 人")

这个管道的优势：

流式处理：每条记录处理完即释放，无需全部加载到内存
惰性求值：数据只在迭代时处理
可组合：各步骤独立，易于维护和测试

总结

生成器是 Python 不可或缺的部分：

特性	说明
`__iter__` / `__next__`	迭代器协议的基础
`yield`	创建生成器函数，惰性求值
`yield from`	委托给子生成器
`send()`	向生成器发送值
`throw()`	向生成器抛出异常
`close()`	正常终止生成器

掌握生成器，能够编写出内存高效、处理大数据集的 Python 代码。

张会挽's Blog

Python 生成器与迭代器：yield 关键字的奥秘