Initial commit: 百家号文章采集系统

This commit is contained in:
sjk
2025-12-19 22:48:58 +08:00
commit 0d5bbb1864
37 changed files with 11774 additions and 0 deletions

110
.gitignore vendored Normal file
View File

@@ -0,0 +1,110 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# Virtual Environment
venv/
ENV/
env/
.venv
# PyInstaller
*.manifest
*.spec
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.log
.pytest_cache/
# Flask
instance/
.webassets-cache
# Scrapy
.scrapy
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# IDEs
.idea/
.vscode/
*.swp
*.swo
*~
.DS_Store
# Database
*.db
*.sqlite
*.sqlite3
# Data files (如果不需要提交数据)
data/*.json
data/results/*.xlsx
baidu_data.json
baijiahao_data.json
*.backup
app.py.backup
# Excel exports
exports/*.xlsx
# Zip archives
*.zip
# Log files
*.log
logs/
# OS
.DS_Store
Thumbs.db
desktop.ini
# Temporary files
tmp/
temp/
*.tmp
*.bak
*.swp
# Service files (系统特定)
*.service
# Process IDs
*.pid

177
DATABASE_MIGRATION.md Normal file
View File

@@ -0,0 +1,177 @@
# SQLite 数据库迁移说明
## 概述
系统已从 JSON 文件存储迁移到 SQLite 数据库,提供更好的性能、并发支持和数据完整性。
## 主要变更
### 1. 新增文件
- `database.py` - SQLite 数据库管理模块
- `test_database.py` - 数据库功能测试脚本
- `data/baijiahao.db` - SQLite 数据库文件(自动创建)
### 2. 修改文件
- `task_queue.py` - 使用 SQLite 替代 JSON 文件存储
### 3. 数据迁移
- 旧数据自动从 `data/task_queue.json` 迁移到数据库
- 迁移成功后会创建备份文件 `data/task_queue.json.backup`
- 原 JSON 文件保留,可安全删除
## 数据库结构
### tasks 表
```sql
CREATE TABLE tasks (
task_id TEXT PRIMARY KEY, -- 任务ID
url TEXT NOT NULL, -- 百家号URL
months REAL NOT NULL, -- 获取月数
use_proxy INTEGER NOT NULL, -- 是否使用代理 (0/1)
proxy_api_url TEXT, -- 代理API地址
username TEXT, -- 用户名
status TEXT NOT NULL, -- 任务状态
created_at TEXT NOT NULL, -- 创建时间
started_at TEXT, -- 开始时间
completed_at TEXT, -- 完成时间
progress INTEGER DEFAULT 0, -- 进度 (0-100)
current_step TEXT, -- 当前步骤
total_articles INTEGER DEFAULT 0, -- 总文章数
processed_articles INTEGER DEFAULT 0, -- 已处理文章数
error TEXT, -- 错误信息
result_file TEXT -- 结果文件路径
);
```
### 索引
- `idx_tasks_status` - 状态索引(加速状态查询)
- `idx_tasks_username` - 用户名索引(加速用户过滤)
- `idx_tasks_created_at` - 创建时间索引(加速时间排序)
## 优势
### 1. 性能提升
- 索引支持,查询速度更快
- 优化的 SQL 查询,减少内存占用
- 不再需要每次操作都读写整个文件
### 2. 并发安全
- 线程安全的连接管理
- 数据库级别的事务支持
- 避免文件锁冲突
### 3. 数据完整性
- 主键约束防止重复
- 事务支持确保数据一致性
- 异常回滚机制
### 4. 可扩展性
- 易于添加新字段和索引
- 支持复杂查询和统计
- 便于后续功能扩展
## 使用说明
### 测试数据库功能
```bash
python test_database.py
```
### 手动迁移数据
```python
from database import migrate_from_json
# 从 JSON 迁移到数据库
count = migrate_from_json("data/task_queue.json")
print(f"迁移了 {count} 个任务")
```
### 直接使用数据库
```python
from database import get_database
db = get_database()
with db.get_connection() as conn:
cursor = conn.cursor()
cursor.execute("SELECT * FROM tasks")
tasks = cursor.fetchall()
```
## 向后兼容
- 原有 API 接口不变
- 无需修改调用代码
- 自动迁移旧数据
- 保留原 JSON 文件
## 注意事项
1. **首次启动**:系统会自动创建数据库和迁移数据
2. **备份**:迁移成功后建议备份 `data/baijiahao.db`
3. **清理**:确认迁移成功后可删除 `data/task_queue.json`
4. **性能**:大量任务时数据库性能优势明显
## 故障排除
### 数据库锁定
如果遇到 "database is locked" 错误:
- 检查是否有多个进程同时访问
- 重启应用程序
- 增加超时时间已设置为30秒
### 迁移失败
如果迁移失败:
- 检查 `data/task_queue.json` 格式
- 查看日志错误信息
- 手动运行 `test_database.py` 测试
### 数据丢失
- 检查 `data/task_queue.json.backup` 备份文件
- 从备份恢复后重新迁移
- 使用 SQLite 工具直接查看数据库
## 性能优化
### 已实施的优化
- 添加索引加速查询
- 使用事务批量操作
- 连接池复用连接
- Row factory 减少转换
### 未来优化方向
- 定期清理旧任务
- 数据归档机制
- 分表策略
- 查询缓存
## 相关命令
### 查看数据库
```bash
# 使用 SQLite 命令行工具
sqlite3 data/baijiahao.db
# 查看表结构
.schema tasks
# 查看所有任务
SELECT * FROM tasks;
# 统计任务数量
SELECT status, COUNT(*) FROM tasks GROUP BY status;
```
### 备份数据库
```bash
# 简单备份
cp data/baijiahao.db data/baijiahao.db.backup
# 使用 SQLite 工具备份
sqlite3 data/baijiahao.db ".backup data/baijiahao.db.backup"
```
### 恢复数据库
```bash
# 从备份恢复
cp data/baijiahao.db.backup data/baijiahao.db
```

211
QUEUE_USAGE.md Normal file
View File

@@ -0,0 +1,211 @@
# 任务队列功能使用说明
## 📋 功能概述
新增了任务队列系统,支持**离线处理**、**进度跟踪**和**结果汇总导出**。
## 🚀 启动服务
```bash
# 开发环境
python app.py
# 生产环境(推荐)
bash start.sh
```
启动时会自动:
- ✅ 创建必要的目录data/, data/results/
- ✅ 启动后台任务处理器
- ✅ 自动处理队列中的任务
## 💡 使用方式
### 方式1即时导出原有功能
1. 填写百家号URL和参数
2. 点击**"即时导出"**按钮
3. 同步等待处理完成
4. 立即下载结果
**适用场景:** 需要立即获取结果
---
### 方式2队列处理新功能
1. 填写百家号URL和参数
2. 点击**"添加到队列"**按钮
3. 任务加入队列,后台慢慢处理
4. 可以继续添加其他任务
5. 在"任务队列"页面查看进度
6. 完成后下载结果
**适用场景:**
- 批量处理多个账号
- 不需要立即获取结果
- 避免长时间等待阻塞
## 📊 任务队列管理页面
点击顶部的**"任务队列"**按钮进入管理页面,可以:
### 1. 查看统计信息
- 总任务数
- 等待中任务数
- 处理中任务数
- 已完成任务数
- 失败任务数
### 2. 筛选任务
- 全部任务
- 等待中
- 处理中
- 已完成
- 失败
### 3. 查看任务详情
每个任务显示:
- 百家号URL
- 任务状态
- 处理进度0-100%
- 当前步骤说明
- 时间范围
- 创建时间
- 文章总数
- 是否使用代理
### 4. 下载结果
- 已完成的任务显示"下载结果"按钮
- 点击即可下载Excel文件
### 5. 自动刷新
- 页面每5秒自动刷新一次
- 实时查看最新进度
## 🗂️ 数据存储
### 任务队列文件
```
data/task_queue.json
```
存储所有任务的状态、进度、配置等信息
### 导出结果文件
```
data/results/百家号文章_{app_id}_{timestamp}.xlsx
```
每个任务的Excel结果文件
## 📝 任务处理流程
```
用户添加任务
加入队列pending
后台处理器检测到任务
标记为处理中processing
步骤1: 解析URL获取UK (10%)
步骤2: 初始化爬虫 (20%)
步骤3: 获取文章列表 (30%)
步骤4: 处理文章数据 (50%-90%)
步骤5: 生成Excel文件 (90%)
标记为已完成completed
用户下载结果
```
## 🔄 任务状态说明
| 状态 | 说明 | 颜色标识 |
|------|------|---------|
| pending | 等待处理 | 黄色 |
| processing | 正在处理 | 蓝色 |
| completed | 处理完成 | 绿色 |
| failed | 处理失败 | 红色 |
## ⚙️ 技术特性
### 1. 离线处理
- ✅ 添加任务后无需等待
- ✅ 后台自动处理
- ✅ 支持批量添加
### 2. 进度跟踪
- ✅ 实时显示进度百分比
- ✅ 显示当前处理步骤
- ✅ 显示已处理文章数
### 3. 错误处理
- ✅ 失败任务显示错误信息
- ✅ 代理失败自动切换IP
- ✅ 反爬检测自动重试
### 4. 数据持久化
- ✅ 任务状态保存到本地JSON
- ✅ 服务重启后继续处理
- ✅ 结果文件永久保存
### 5. 用户隔离
- ✅ 每个用户只能看到自己的任务
- ✅ 统计信息按用户过滤
- ✅ 下载权限校验
## 🎯 最佳实践
1. **大批量采集**
- 使用"添加到队列"
- 一次性添加多个账号
- 让系统慢慢处理
2. **紧急需求**
- 使用"即时导出"
- 实时获取结果
3. **代理配置**
- 默认启用代理IP池
- 系统自动处理反爬
- 检测到反爬立即切换IP
4. **定期清理**
- 系统会保留7天内的已完成任务
- 可手动删除旧任务(功能可扩展)
## 🐛 常见问题
**Q: 任务一直处于"等待中"状态?**
A: 检查后台处理器是否启动,查看控制台日志
**Q: 任务失败了怎么办?**
A: 查看失败原因,修改参数后重新添加任务
**Q: 可以同时处理多少个任务?**
A: 目前每次处理1个任务按队列顺序依次处理
**Q: 结果文件在哪里?**
A: `data/results/` 目录下文件名包含app_id和时间戳
## 🔧 开发说明
### 核心文件
- `task_queue.py` - 任务队列管理
- `task_worker.py` - 后台处理器
- `templates/queue.html` - 队列管理页面
- `data/task_queue.json` - 任务数据存储
### API接口
- `POST /api/queue/add` - 添加任务
- `GET /api/queue/tasks` - 获取任务列表
- `GET /api/queue/task/<id>` - 获取任务详情
- `GET /api/queue/stats` - 获取统计信息
- `GET /api/queue/download/<id>` - 下载结果
---
**享受高效的批量处理!** 🎉

247
README.md Normal file
View File

@@ -0,0 +1,247 @@
# 百家号文章导出工具
一个用于导出百家号作者指定时间内发文信息的Web工具。
## 快速启动
### 方式1使用 Gunicorn 启动(推荐生产环境)
```bash
# 赋予执行权限(首次运行)
chmod +x start.sh stop.sh
# 安装 gunicorn如果未安装
pip install gunicorn
# 使用 Gunicorn 启动(默认)
./start.sh
# 或明确指定
./start.sh gunicorn
# 停止服务
./stop.sh
```
### 方式2使用 nohup 启动(开发测试)
```bash
# 使用 nohup 模式启动
./start.sh nohup
# 停止服务
./stop.sh
```
### 方式3手动启动
```bash
# 1. 创建虚拟环境(首次运行)
python3 -m venv .venv
# 2. 激活虚拟环境
source .venv/bin/activate
# 3. 安装依赖(首次运行)
pip install -r requirements.txt
# 4. 启动服务
python app.py
```
服务将在 `http://127.0.0.1:8030` 启动
## 功能特点
- 📝 导出百家号作者指定时间内的文章信息
- 📋 任务队列功能,支持离线处理
- 🔄 动态并发处理,智能调整线程数
- 📊 生成Excel格式文件
- 🎯 包含文章标题、链接和发布时间
- 🎨 简洁美观的Web界面钉钉科技蓝风格
- 🔐 用户登录权限系统
## 技术栈
- **后端**: Python + Flask
- **前端**: HTML + CSS + jQuery
- **数据处理**: Pandas + BeautifulSoup4
- **Excel导出**: OpenPyXL
## 安装步骤
### 1. 克隆项目
```bash
git clone <repository-url>
cd ai_baijiahao
```
### 2. 创建虚拟环境
```bash
python3 -m venv .venv
source .venv/bin/activate # Linux/Mac
# 或
.venv\Scripts\activate # Windows
```
### 3. 安装依赖
```bash
pip install -r requirements.txt
```
### 4. 启动服务
```bash
# 使用启动脚本Linux/Mac
chmod +x start.sh
./start.sh
# 或手动启动
python app.py
```
服务将在 `http://127.0.0.1:8030` 启动
## 使用说明
### 登录系统
1. 首次访问需要注册账号
2. 输入用户名和密码登录
### 即时导出
1. 在浏览器中打开百家号作者主页复制完整的URL地址
- 例如: `https://baijiahao.baidu.com/u?app_id=1700253559210167`
2. 在工具页面输入URL地址选择时间范围
3. 点击"开始导出"按钮,等待数据获取完成
4. 导出成功后,点击"下载Excel文件"保存文件
### 队列导出
1. 点击"任务队列"菜单
2. 添加多个导出任务到队列
3. 系统会自动并发处理动态调整1-3个线程
4. 任务完成后,点击"查看"按钮下载Excel文件
## 生产环境部署
### 方案1systemd 服务(推荐)
**优点**
- ✅ 自动重启(进程崩溃时)
- ✅ 开机自启
- ✅ 资源限制
- ✅ 日志管理
- ✅ 服务监控
**安装步骤**
```bash
# 1. 安装服务
sudo chmod +x install_service.sh
sudo ./install_service.sh
# 2. 启动服务
sudo systemctl start baijiahao
# 3. 查看状态
sudo systemctl status baijiahao
# 4. 查看日志
sudo journalctl -u baijiahao -f
```
**常用命令**
```bash
# 启动/停止/重启
sudo systemctl start baijiahao
sudo systemctl stop baijiahao
sudo systemctl restart baijiahao
# 查看状态和日志
sudo systemctl status baijiahao
sudo journalctl -u baijiahao -f
# 禁用/启用开机自启
sudo systemctl disable baijiahao
sudo systemctl enable baijiahao
```
### 方案2nohup简单场景
**优点**:简单快速
**缺点**:无自动重启、无开机自启、管理困难
```bash
# 使用项目提供的启动脚本
./start.sh
# 或手动使用 nohup
nohup python app.py > logs/app.log 2>&1 &
```
### 方案3Supervisor备选
安装 Supervisor
```bash
sudo apt-get install supervisor
```
创建配置文件 `/etc/supervisor/conf.d/baijiahao.conf`
```ini
[program:baijiahao]
command=/var/www/ai_baijiahao/.venv/bin/python app.py
directory=/var/www/ai_baijiahao
user=www-data
autostart=true
autorestart=true
stdout_logfile=/var/www/ai_baijiahao/logs/app.log
stderr_logfile=/var/www/ai_baijiahao/logs/error.log
```
启动服务:
```bash
sudo supervisorctl reread
sudo supervisorctl update
sudo supervisorctl start baijiahao
```
## 项目结构
```
ai_baijiahao/
├── app.py # Flask后端服务
├── requirements.txt # Python依赖
├── templates/ # HTML模板
│ └── index.html
├── static/ # 静态资源
│ ├── css/
│ │ └── style.css
│ └── js/
│ └── main.js
└── exports/ # Excel导出目录自动创建
```
## 注意事项
- 请确保输入的是有效的百家号作者主页地址
- 导出过程可能需要一些时间,请耐心等待
- 如果文章数量较多,导出时间会相应延长
- 本工具仅供学习交流使用
## 许可证
仅供学习交流使用

View File

@@ -0,0 +1,179 @@
# TaskWorker 卡住问题解决方案
## 问题现象
线上部署时,所有任务都在"等待中"状态卡住,无法被处理。
## 根本原因
在使用 Gunicorn 部署时TaskWorker 可能因为以下原因未能正常启动或中途崩溃:
1. **多进程竞争**:多个 worker 进程同时启动导致冲突
2. **锁文件失效**:进程异常退出后锁文件未清理
3. **线程崩溃**:工作线程因异常而停止
## 解决方案
### 方案1使用诊断工具推荐
已创建专门的诊断和修复工具 `check_taskworker.py`
#### 检查状态
```bash
python check_taskworker.py
```
#### 自动修复
```bash
python check_taskworker.py --fix
```
### 方案2手动重启服务
```bash
# 停止 Gunicorn
kill -TERM $(cat gunicorn.pid)
# 清理锁文件
rm -f data/taskworker.lock
# 重新启动
gunicorn -c gunicorn_config.py app:app
```
### 方案3使用自动监控守护进程生产环境推荐
启动自动监控程序,会定期检查并自动修复:
```bash
# 后台运行
nohup python taskworker_monitor.py > logs/monitor.out 2>&1 &
# 或使用 systemd 管理(推荐)
sudo systemctl start baijiahao-monitor
```
## 预防措施
### 1. 优化的 Gunicorn 配置
已更新 `gunicorn_config.py`使用文件锁fcntl替代简单的存在性检查避免竞争条件。
### 2. 添加健康检查
`app.py` 中添加健康检查接口:
```python
@app.route('/health/taskworker')
def health_taskworker():
"""TaskWorker 健康检查"""
try:
from task_worker import get_task_worker
worker = get_task_worker()
alive_threads = sum(1 for t in worker.worker_threads if t and t.is_alive())
return jsonify({
'status': 'healthy' if worker.running and alive_threads > 0 else 'unhealthy',
'running': worker.running,
'alive_threads': alive_threads,
'current_workers': worker.current_workers,
'processing_tasks': len(worker.processing_tasks)
})
except Exception as e:
return jsonify({'status': 'error', 'message': str(e)}), 500
```
### 3. 使用 Supervisor 管理(可选)
创建 `supervisor.conf`
```ini
[program:baijiahao]
command=/path/to/venv/bin/gunicorn -c gunicorn_config.py app:app
directory=/path/to/ai_baijiahao
user=www-data
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile=/path/to/ai_baijiahao/logs/supervisor.log
[program:baijiahao-monitor]
command=/path/to/venv/bin/python taskworker_monitor.py
directory=/path/to/ai_baijiahao
user=www-data
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile=/path/to/ai_baijiahao/logs/monitor_supervisor.log
```
## 日常维护
### 查看日志
```bash
# TaskWorker 日志
tail -f logs/gunicorn_error.log | grep TaskWorker
# 监控日志
tail -f logs/taskworker_monitor.log
```
### 定期清理
```bash
# 清理旧的任务结果保留最近30天
find data/results -name "*.xlsx" -mtime +30 -delete
# 清理旧日志
find logs -name "*.log" -mtime +30 -delete
```
## 监控告警
可以结合监控系统(如 Prometheus + Grafana监控以下指标
1. **TaskWorker 运行状态**`/health/taskworker`
2. **待处理任务数**:通过 API 获取队列统计
3. **处理中任务数**
4. **平均任务处理时间**
## 常见问题
### Q1: 重启后任务会丢失吗?
**A**: 不会。所有任务都存储在 SQLite 数据库中,重启后会自动继续处理。
### Q2: 如何调整并发数?
**A**: 修改 `task_worker.py` 中的 `TaskWorker` 初始化参数:
```python
worker = TaskWorker(min_workers=1, max_workers=3) # 调整这两个参数
```
### Q3: 监控程序占用资源吗?
**A**: 极低。监控程序每60秒检查一次几乎不占用 CPU 和内存。
## 升级说明
本次更新包含以下改进:
1.**更健壮的文件锁机制**:使用 `fcntl` 替代简单的文件存在性检查
2.**状态验证**:启动后验证 TaskWorker 是否真正运行
3.**诊断工具**`check_taskworker.py` 快速定位问题
4.**自动监控**`taskworker_monitor.py` 自动检测和修复
5.**详细日志**:记录启动过程和异常信息
## 联系支持
如果问题仍然存在,请提供以下信息:
1. `logs/gunicorn_error.log` 的最近日志
2. `python check_taskworker.py` 的输出
3. 数据库中待处理任务的数量和状态
```bash
# 快速诊断命令
echo "=== Gunicorn 进程 ==="
ps aux | grep gunicorn
echo "=== TaskWorker 锁文件 ==="
ls -lh data/taskworker.lock
echo "=== 任务统计 ==="
python check_taskworker.py
echo "=== 最近日志 ==="
tail -n 50 logs/gunicorn_error.log
```

1123
app.py Normal file

File diff suppressed because it is too large Load Diff

633
baidu_api.py Normal file
View File

@@ -0,0 +1,633 @@
import asyncio
import json
import random
import time
from typing import Dict, Any, Optional
from urllib.parse import quote
import aiohttp
from playwright.async_api import async_playwright
from fake_useragent import UserAgent
import logging
from test2 import display_simple_data
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class BaiduBJHSpider:
def __init__(self, use_proxy: bool = False, proxy_api_url: str = None, proxy_username: str = None, proxy_password: str = None):
self.ua = UserAgent()
self.use_proxy = use_proxy
self.proxy_api_url = proxy_api_url or 'http://api.tianqiip.com/getip?secret=lu29e593&num=1&type=txt&port=1&mr=1&sign=4b81a62eaed89ba802a8f34053e2c964'
self.proxy_username = proxy_username
self.proxy_password = proxy_password
self.current_proxy = None
self.session_cookie = None
def get_proxy(self):
"""从代理池获取一个代理IP"""
if not self.use_proxy:
return None
try:
import requests
logger.info(f"从代理池获取IP: {self.proxy_api_url}")
response = requests.get(self.proxy_api_url, timeout=5) # 优化超时为5秒
content = response.content.decode("utf-8").strip()
logger.info(f"提取代理IP: {content}")
if ':' in content:
ip, port = content.strip().split(":", 1)
# 如果有认证信息添加到代理URL中
if self.proxy_username and self.proxy_password:
proxy_url = f"http://{self.proxy_username}:{self.proxy_password}@{ip}:{port}"
logger.info(f"代理配置成功(带认证): http://{self.proxy_username}:****@{ip}:{port}")
else:
proxy_url = f"http://{ip}:{port}"
logger.info(f"代理配置成功: {proxy_url}")
self.current_proxy = proxy_url
return proxy_url
else:
logger.error("代理IP格式错误")
return None
except Exception as e:
logger.error(f"获取代理IP失败: {e}")
return None
async def init_browser(self):
"""初始化浏览器环境获取Cookie"""
playwright = await async_playwright().start()
# 配置浏览器参数
browser_args = [
'--disable-blink-features=AutomationControlled',
'--disable-web-security',
'--disable-features=IsolateOrigins,site-per-process',
'--no-sandbox',
'--disable-setuid-sandbox',
]
# 启动浏览器
browser = await playwright.chromium.launch(
headless=True, # 设置为True可以无头模式运行
args=browser_args
)
# 创建上下文
context = await browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent=self.ua.random,
locale='zh-CN',
timezone_id='Asia/Shanghai'
)
# 设置额外的HTTP头
await context.set_extra_http_headers({
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
})
page = await context.new_page()
# 首先访问百度首页获取基础Cookie
await page.goto('https://www.baidu.com', wait_until='networkidle')
await asyncio.sleep(random.uniform(2, 4))
# 访问百家号页面
await page.goto('https://baijiahao.baidu.com/', wait_until='networkidle')
await asyncio.sleep(random.uniform(3, 5))
# 获取Cookie
cookies = await context.cookies()
self.session_cookie = '; '.join([f"{c['name']}={c['value']}" for c in cookies])
logger.info(f"获取到Cookie: {self.session_cookie[:50]}...")
await browser.close()
await playwright.stop()
return cookies
def build_headers(self, referer: str = "https://baijiahao.baidu.com/") -> Dict:
"""构建请求头"""
timestamp = int(time.time() * 1000)
headers = {
'User-Agent': self.ua.random,
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': referer,
'Connection': 'keep-alive',
'Sec-Fetch-Dest': 'script',
'Sec-Fetch-Mode': 'no-cors',
'Sec-Fetch-Site': 'same-site',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
if self.session_cookie:
headers['Cookie'] = self.session_cookie
return headers
def generate_callback_name(self) -> str:
"""生成随机的callback函数名"""
timestamp = int(time.time() * 1000)
return f"__jsonp{timestamp}"
async def fetch_data_directly(self, uk: str = "ntHidnLhrlfclJar2z8wBg", use_browser: bool = False, num: int = 10,
ctime: str = None) -> Optional[Dict]:
"""直接请求接口(可能需要多次尝试)
Args:
uk: 作者UK
use_browser: 是否使用浏览器获取Cookie默认False不启动浏览器
num: 请求数据条数API固定为10
ctime: 分页参数上一次请求返回的query.ctime值
"""
# 只在use_browser=True时才初始化浏览器获取Cookie
if use_browser:
await self.init_browser()
# 如果启用代理必须先获取一个代理IP失败则抛出异常不使用本机IP
if self.use_proxy:
if not self.current_proxy:
proxy = self.get_proxy()
if not proxy:
raise Exception("启用了代理但无法获取代理IP拒绝使用本机IP")
async with aiohttp.ClientSession() as session:
for attempt in range(10): # 增加到10次重试应对IP池限流
try:
callback_name = self.generate_callback_name()
timestamp = int(time.time() * 1000)
# 构建URL参数
params = {
'tab': 'main',
'num': '10', # API固定为10
'uk': uk,
'source': 'pc',
'type': 'newhome',
'action': 'dynamic',
'format': 'jsonp',
'callback': callback_name,
'otherext': f'h5_{time.strftime("%Y%m%d%H%M%S")}',
'Tenger-Mhor': str(timestamp),
'_': str(timestamp) # 添加时间戳参数
}
# 如果有ctime参数,添加到请求中(用于分页)
if ctime:
params['ctime'] = ctime
url = "https://mbd.baidu.com/webpage"
headers = self.build_headers()
logger.info(f"尝试第{attempt + 1}次请求URL: {url}")
# 准备请求参数
request_kwargs = {
'params': params,
'headers': headers,
'timeout': aiohttp.ClientTimeout(total=30)
}
# 如果使用代理,添加代理配置(必须有代理才请求)
if self.use_proxy:
if not self.current_proxy:
raise Exception("启用了代理但当前无代理IP拒绝使用本机IP")
logger.info(f"使用代理: {self.current_proxy}")
request_kwargs['proxy'] = self.current_proxy
async with session.get(url, **request_kwargs) as response:
text = await response.text()
# 提取JSONP数据
if text.startswith(callback_name + '(') and text.endswith(')'):
json_str = text[len(callback_name) + 1:-1]
data = json.loads(json_str)
# 检查是否被反爬
if data.get('data', {}).get('foe', {}).get('is_need_foe') == True:
logger.warning(f"检测到反爬标识(is_need_foe=True),尝试第{attempt + 1}")
# 如果启用了代理立即切换IP
if self.use_proxy:
logger.info("检测到反爬立即切换代理IP无需等待")
self.get_proxy()
# 继续重试
if attempt < 9: # 还有重试机会总共10次
continue
return data
except aiohttp.ClientConnectorError as e:
logger.error(f"❌ 网络连接失败 (尝试{attempt + 1}/10): {type(e).__name__} - {str(e)[:100]}")
if self.use_proxy:
logger.info("🔄 网络错误立即切换代理IP")
self.get_proxy()
except asyncio.TimeoutError as e:
logger.error(f"❌ 请求超时 (尝试{attempt + 1}/10): 代理响应超过30秒")
if self.use_proxy:
logger.info("🔄 超时立即切换代理IP")
self.get_proxy()
except aiohttp.ClientProxyConnectionError as e:
logger.error(f"❌ 代理连接失败 (尝试{attempt + 1}/10): {e}")
# 代理失败,立即重新获取
if self.use_proxy:
logger.info("🔄 代理失败立即切换代理IP无需等待")
self.get_proxy()
# 代理错误不需要等待,直接重试
except aiohttp.ClientResponseError as e:
# 检查是否是407错误代理认证失败/IP池限流
if e.status == 407:
logger.warning(f"检测到407错误代理IP池限流等待10秒后重新获取IP...")
await asyncio.sleep(10) # 增加到10秒给IP池缓冲时间
if self.use_proxy:
logger.info("重新获取代理IP...")
self.get_proxy()
# 继续重试
else:
logger.error(f"❌ HTTP错误 (尝试{attempt + 1}/10): {e.status}, {e.message}")
await asyncio.sleep(random.uniform(1, 2))
except Exception as e:
logger.error(f"❌ 未知错误 (尝试{attempt + 1}/10): {type(e).__name__} - {str(e)[:100]}")
await asyncio.sleep(random.uniform(1, 2)) # 减少到1-2秒
# 10次重试全部失败
logger.error("请求失败已经重试10次仍然失败可能是IP池限流或网络问题")
return None
async def fetch_via_browser(self, uk: str = "ntHidnLhrlfclJar2z8wBg") -> Optional[Dict]:
"""通过浏览器直接执行获取数据(最可靠的方法)"""
playwright = await async_playwright().start()
try:
browser = await playwright.chromium.launch(
headless=False, # 调试时可设为False
args=[
'--disable-blink-features=AutomationControlled',
'--no-sandbox'
]
)
context = await browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent=self.ua.random,
locale='zh-CN'
)
page = await context.new_page()
# 监听网络请求
results = []
def handle_response(response):
if "mbd.baidu.com/webpage" in response.url and "format=jsonp" in response.url:
try:
# 尝试提取JSONP数据
text = response.text()
if "callback=" in response.url:
# 从URL提取callback名称
import re
match = re.search(r'callback=([^&]+)', response.url)
if match:
callback = match.group(1)
if text.startswith(callback + '(') and text.endswith(')'):
json_str = text[len(callback) + 1:-1]
data = json.loads(json_str)
results.append(data)
except:
pass
page.on("response", handle_response)
# 访问百家号页面
await page.goto(f"https://baijiahao.baidu.com/u?app_id={uk}", wait_until='networkidle')
# 模拟用户滚动
for _ in range(3):
await page.evaluate("window.scrollBy(0, window.innerHeight)")
await asyncio.sleep(random.uniform(1, 2))
# 等待数据加载
await asyncio.sleep(5)
await browser.close()
if results:
return results[0]
except Exception as e:
logger.error(f"浏览器方式获取失败: {e}")
finally:
await playwright.stop()
return None
async def fetch_with_signature(self, uk: str = "ntHidnLhrlfclJar2z8wBg") -> Optional[Dict]:
"""尝试使用签名参数请求"""
# 百度接口可能需要特定的签名参数
# 这里需要分析JavaScript找到签名算法
async with aiohttp.ClientSession() as session:
# 先获取必要的token
token_url = "https://mbd.baidu.com/staticx/search/dynamic/config"
headers = {
'User-Agent': self.ua.random,
'Referer': 'https://baijiahao.baidu.com/',
}
try:
# 获取配置信息
async with session.get(token_url, headers=headers) as resp:
config_text = await resp.text()
logger.info(f"配置响应: {config_text[:200]}")
# 构建完整请求
timestamp = int(time.time() * 1000)
params = {
'tab': 'main',
'num': '10',
'uk': uk,
'source': 'pc',
'type': 'newhome',
'action': 'dynamic',
'format': 'json',
't': str(timestamp),
'callback': f'__jsonp{timestamp}',
}
# 尝试JSON格式非JSONP
params['format'] = 'json'
del params['callback']
url = "https://mbd.baidu.com/webpage"
async with session.get(url, params=params, headers=headers) as response:
text = await response.text()
logger.info(f"JSON响应: {text[:500]}")
try:
return json.loads(text)
except:
return None
except Exception as e:
logger.error(f"签名方式失败: {e}")
return None
async def fetch_baidu_data(uk: str = "ntHidnLhrlfclJar2z8wBg", months: int = 6, use_proxy: bool = False, proxy_api_url: str = None,
on_page_fetched=None, start_page: int = 1, start_ctime: str = None) -> Optional[Dict]:
"""获取百家号数据的主函数
Args:
uk: 作者UK
months: 获取近几个月的数据默认6个月支持小数如0.33代衡10天
use_proxy: 是否启用代理IP池
proxy_api_url: 代理API地址留空使用默认
on_page_fetched: 回调函数每页数据抽取后调用signature: (page, items, ctime) -> None
start_page: 起始页码(断点续传)
start_ctime: 起始分页参数(断点续传)
"""
from datetime import datetime, timedelta
import re
spider = BaiduBJHSpider(use_proxy=use_proxy, proxy_api_url=proxy_api_url)
# 计算目标日期(支持小数月份)
days = int(months * 30)
target_date = datetime.now() - timedelta(days=days)
# 日志输出优化
if months < 1:
logger.info(f"开始获取百家号数据(近{days}天, 目标日期: {target_date.strftime('%Y-%m-%d')})")
else:
logger.info(f"开始获取百家号数据(近{int(months)}个月, 目标日期: {target_date.strftime('%Y-%m-%d')})")
# 先获取第一页数据(每次固定10条)
# 注意:不再使用 all_articles 累加,每页直接通过回调保存
page = start_page # 支持从指定页码开始
current_ctime = start_ctime # 支持使用之前的分页参数
# 如果是断点续传直接跳过第一页使用保存的ctime
if start_page > 1 and start_ctime:
logger.info(f"断点续传:从第{start_page}页开始ctime={start_ctime}")
data = None # 不需要第一页的data结构
else:
# 优化: 直接请求API,只有失败时才启动浏览器
logger.info("尝试直接请求API(不启动浏览器)...")
data = await spider.fetch_data_directly(uk, use_browser=False, ctime=current_ctime)
# 如果第一次失败,再启动浏览器重试
if not data or not data.get('data', {}).get('list'):
if start_page == 1: # 只有第一页才需要浏览器重试
logger.warning("直接请求失败,启动浏览器获取Cookie...")
# 打印第一次请求的返回数据
if data:
logger.warning(f"第一次请求返回数据: {json.dumps(data, ensure_ascii=False, indent=2)}")
else:
logger.warning("第一次请求返回数据: None")
data = await spider.fetch_data_directly(uk, use_browser=True)
if not data or not data.get('data', {}).get('list'):
logger.error("启动浏览器后仍然失败,放弃")
# 打印最终的返回数据
if data:
logger.error(f"最终返回数据: {json.dumps(data, ensure_ascii=False, indent=2)}")
else:
logger.error("最终返回数据: None")
return None
# 第一次请求成功,处理数据(只有非断点续传时)
if data and data.get('data', {}).get('list'):
items = data.get('data', {}).get('list', [])
logger.info(f"{page}页获取成功,数据条数: {len(items)}")
# 调用回调保存第一页数据
if on_page_fetched:
on_page_fetched(page, items, current_ctime)
# 提取第一页的ctime用于分页 - 注意路径是 data.data.query.ctime
current_ctime = data.get('data', {}).get('query', {}).get('ctime', current_ctime)
if current_ctime:
logger.info(f"获取到分页参数 ctime={current_ctime}")
else:
logger.warning("未获取到ctime分页参数")
# 使用ctime(Unix时间戳)进行时间判断,更准确
def get_article_datetime(item_data: dict) -> datetime:
"""从ittemData中提取文章时间
优先使用ctime(Unix时间戳),更准确
"""
# 优先使用ctime(秒级Unix时间戳)
if 'ctime' in item_data and item_data['ctime']:
try:
timestamp = int(item_data['ctime'])
return datetime.fromtimestamp(timestamp)
except:
pass
# 备用: 使用time字段(相对时间或绝对时间)
time_str = item_data.get('time', '')
if not time_str:
return datetime.now()
now = datetime.now()
if '分钟前' in time_str:
minutes = int(re.search(r'(\d+)', time_str).group(1))
return now - timedelta(minutes=minutes)
elif '小时前' in time_str:
hours = int(re.search(r'(\d+)', time_str).group(1))
return now - timedelta(hours=hours)
elif '天前' in time_str or '昨天' in time_str:
if '昨天' in time_str:
days = 1
else:
days = int(re.search(r'(\d+)', time_str).group(1))
return now - timedelta(days=days)
elif '-' in time_str: # 绝对时间格式
try:
return datetime.strptime(time_str, '%Y-%m-%d %H:%M')
except:
try:
return datetime.strptime(time_str, '%Y-%m-%d')
except:
return now
return now
# 检查最后一篇文章的时间,判断是否需要继续请求
need_more = True
if data and data.get('data', {}).get('list'):
items = data.get('data', {}).get('list', [])
if items:
last_item = items[-1]
item_data = last_item.get('itemData', {})
article_date = get_article_datetime(item_data)
logger.info(f"最后一篇文章时间: {article_date.strftime('%Y-%m-%d %H:%M:%S')}")
if article_date < target_date:
need_more = False
if months < 1:
logger.info(
f"最后一篇文章时间: {article_date.strftime('%Y-%m-%d %H:%M:%S')}, 已超出{days}天范围,停止请求")
else:
logger.info(
f"最后一篇文章时间: {article_date.strftime('%Y-%m-%d %H:%M:%S')}, 已超出{int(months)}个月范围,停止请求")
else:
need_more = False
elif start_page > 1:
# 断点续传时,默认需要继续
need_more = True
else:
need_more = False
# 循环请求后续页面,直到达到目标日期或无数据(不限制页数)
while need_more:
page += 1
logger.info(f"需要更多数据,请求第{page}页...")
# 优化使用随机延迟8-12秒避免被识别为机器行为
delay = random.uniform(8, 12)
logger.info(f"等待 {delay:.1f} 秒后请求...")
await asyncio.sleep(delay)
# 继续请求(使用上一次返回的ctime作为分页参数)
next_data = await spider.fetch_data_directly(uk, use_browser=False, ctime=current_ctime)
# fetch_data_directly已经处理了反爬检测和重试这里只需检查是否成功获取数据
if not next_data or not next_data.get('data', {}).get('list'):
# 如果还是失败,检查是否因为反爬
if next_data and next_data.get('data', {}).get('foe', {}).get('is_need_foe') == True:
logger.error(f"{page}页多次重试后仍然触发反爬,停止请求")
logger.error(f"返回数据: {json.dumps(next_data, ensure_ascii=False, indent=2)}")
else:
logger.warning(f"{page}页无数据,停止请求")
# 打印完整的返回结果以便调试
if next_data:
logger.warning(f"返回数据: {json.dumps(next_data, ensure_ascii=False, indent=2)}")
else:
logger.warning("返回数据: None")
break
next_items = next_data.get('data', {}).get('list', [])
logger.info(f"{page}页获取成功,数据条数: {len(next_items)}")
# 调用回调保存这一页数据
if on_page_fetched:
on_page_fetched(page, next_items, current_ctime)
# 更新ctime为下一次请求做准备 - 注意路径是 data.data.query.ctime
current_ctime = next_data.get('data', {}).get('query', {}).get('ctime', current_ctime)
if current_ctime:
logger.info(f"更新分页参数 ctime={current_ctime}")
# 检查最后一篇文章的时间
if next_items:
last_item = next_items[-1]
item_data = last_item.get('itemData', {})
article_date = get_article_datetime(item_data)
logger.info(f"最后一篇文章时间: {article_date.strftime('%Y-%m-%d %H:%M:%S')}")
if article_date < target_date:
need_more = False
if months < 1:
logger.info(
f"最后一篇文章时间: {article_date.strftime('%Y-%m-%d %H:%M:%S')}, 已超出{days}天范围,停止请求")
else:
logger.info(
f"最后一篇文章时间: {article_date.strftime('%Y-%m-%d %H:%M:%S')}, 已超出{int(months)}个月范围,停止请求")
else:
need_more = False
# 返回最后的分页信息(用于断点续传)
result = {
'last_page': page,
'last_ctime': current_ctime,
'completed': not need_more # 是否已完成
}
logger.info(f"抓取完成,最后页码: {page}, ctime: {current_ctime}")
return result
# 同步包装函数(便于在同步代码中调用)
def get_baidu_data_sync(uk: str = "ntHidnLhrlfclJar2z8wBg", months: int = 6, use_proxy: bool = False,
proxy_api_url: str = None, on_page_fetched=None,
start_page: int = 1, start_ctime: str = None) -> Optional[Dict]:
"""同步方式获取数据
Args:
uk: 作者UK
months: 获取近几个月的数据默认6个月
use_proxy: 是否启用代理IP池
proxy_api_url: 代理API地址留空使用默认
on_page_fetched: 回调函数,每页数据抽取后调用
start_page: 起始页码(断点续传)
start_ctime: 起始分页参数(断点续传)
"""
return asyncio.run(fetch_baidu_data(uk, months, use_proxy, proxy_api_url,
on_page_fetched, start_page, start_ctime))
# 保留原有的main函数用于测试
async def main():
data = await fetch_baidu_data()
if data:
print(json.dumps(data, ensure_ascii=False, indent=2))
from test2 import display_simple_data
display_simple_data(data)
if __name__ == "__main__":
asyncio.run(main())

222
check_taskworker.py Normal file
View File

@@ -0,0 +1,222 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
TaskWorker 状态检查和修复工具
用于诊断和解决任务卡在等待中的问题
"""
import os
import sys
import logging
import psutil
import time
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s [%(levelname)s] %(message)s'
)
logger = logging.getLogger(__name__)
def check_taskworker_lock():
"""检查 TaskWorker 锁文件"""
lock_file = 'data/taskworker.lock'
if os.path.exists(lock_file):
try:
with open(lock_file, 'r') as f:
pid = f.read().strip()
logger.info(f"发现锁文件记录的PID: {pid}")
# 检查进程是否存在
try:
pid_int = int(pid)
if psutil.pid_exists(pid_int):
proc = psutil.Process(pid_int)
logger.info(f"进程 {pid} 存在: {proc.name()} - {proc.status()}")
return True, pid_int
else:
logger.warning(f"进程 {pid} 不存在,锁文件已失效")
return False, None
except ValueError:
logger.error(f"锁文件内容无效: {pid}")
return False, None
except Exception as e:
logger.error(f"读取锁文件失败: {e}")
return False, None
else:
logger.info("未发现锁文件")
return False, None
def check_pending_tasks():
"""检查等待中的任务数量"""
try:
from task_queue import get_task_queue
queue = get_task_queue()
tasks = queue.get_all_tasks()
pending_tasks = [t for t in tasks if t.get('status') == 'pending']
processing_tasks = [t for t in tasks if t.get('status') == 'processing']
logger.info(f"待处理任务: {len(pending_tasks)}")
logger.info(f"处理中任务: {len(processing_tasks)}")
if pending_tasks:
logger.info("待处理任务列表:")
for task in pending_tasks[:5]: # 只显示前5个
logger.info(f" - {task['task_id']}: {task.get('url', 'N/A')[:50]}")
return len(pending_tasks), len(processing_tasks)
except Exception as e:
logger.error(f"检查任务失败: {e}")
return 0, 0
def check_worker_threads():
"""检查 TaskWorker 线程是否运行"""
try:
from task_worker import get_task_worker
worker = get_task_worker()
logger.info(f"TaskWorker 运行状态: {worker.running}")
logger.info(f"当前并发数: {worker.current_workers}/{worker.max_workers}")
logger.info(f"工作线程数: {len(worker.worker_threads)}")
logger.info(f"正在处理的任务: {len(worker.processing_tasks)}")
# 检查线程是否活跃
alive_threads = sum(1 for t in worker.worker_threads if t and t.is_alive())
logger.info(f"活跃线程数: {alive_threads}")
return worker.running, alive_threads
except Exception as e:
logger.error(f"检查 TaskWorker 失败: {e}")
import traceback
logger.error(traceback.format_exc())
return False, 0
def restart_taskworker():
"""重启 TaskWorker"""
logger.info("正在重启 TaskWorker...")
try:
from task_worker import get_task_worker
worker = get_task_worker()
# 停止现有 worker
if worker.running:
logger.info("停止现有 TaskWorker...")
worker.stop()
time.sleep(2)
# 启动新的 worker
logger.info("启动新的 TaskWorker...")
worker.start()
time.sleep(1)
# 验证启动状态
running, alive_threads = check_worker_threads()
if running and alive_threads > 0:
logger.info("✅ TaskWorker 重启成功")
return True
else:
logger.error("❌ TaskWorker 重启失败")
return False
except Exception as e:
logger.error(f"重启 TaskWorker 失败: {e}")
import traceback
logger.error(traceback.format_exc())
return False
def clean_stale_lock():
"""清理失效的锁文件"""
lock_file = 'data/taskworker.lock'
if os.path.exists(lock_file):
try:
os.remove(lock_file)
logger.info("✅ 已清理失效的锁文件")
return True
except Exception as e:
logger.error(f"清理锁文件失败: {e}")
return False
return True
def main():
"""主函数"""
print("=" * 60)
print("TaskWorker 状态检查工具")
print("=" * 60)
# 1. 检查锁文件
print("\n[1] 检查锁文件...")
lock_exists, lock_pid = check_taskworker_lock()
# 2. 检查待处理任务
print("\n[2] 检查任务队列...")
pending_count, processing_count = check_pending_tasks()
# 3. 检查 Worker 线程
print("\n[3] 检查 TaskWorker 状态...")
try:
is_running, alive_threads = check_worker_threads()
except:
is_running, alive_threads = False, 0
# 4. 诊断和修复
print("\n[4] 诊断结果:")
print("-" * 60)
need_fix = False
if pending_count > 0 and alive_threads == 0:
print("❌ 问题: 有待处理任务,但没有活跃的工作线程")
need_fix = True
if lock_exists and not lock_pid:
print("⚠️ 警告: 锁文件存在但进程不存在(僵尸锁)")
need_fix = True
if not is_running:
print("❌ 问题: TaskWorker 未运行")
need_fix = True
if not need_fix:
print("✅ TaskWorker 运行正常")
return
# 5. 修复
print("\n[5] 开始修复...")
print("-" * 60)
if '--fix' in sys.argv or '--auto-fix' in sys.argv:
# 清理失效的锁文件
clean_stale_lock()
# 重启 TaskWorker
if restart_taskworker():
print("\n✅ 修复完成!")
print("\n重新检查状态...")
time.sleep(2)
check_worker_threads()
check_pending_tasks()
else:
print("\n❌ 修复失败,请手动重启服务")
else:
print("\n提示: 使用 --fix 参数自动修复问题")
print("示例: python check_taskworker.py --fix")
if __name__ == '__main__':
try:
main()
except KeyboardInterrupt:
print("\n\n用户中断")
except Exception as e:
logger.error(f"执行失败: {e}")
import traceback
traceback.print_exc()

413
database.py Normal file
View File

@@ -0,0 +1,413 @@
# -*- coding: utf-8 -*-
"""
SQLite 数据库管理模块
用于替换原有的 JSON 文件存储方式
"""
import sqlite3
import os
import logging
from datetime import datetime
from contextlib import contextmanager
import threading
logger = logging.getLogger(__name__)
class Database:
"""SQLite 数据库管理器"""
def __init__(self, db_path="data/baijiahao.db"):
self.db_path = db_path
self._local = threading.local()
self._ensure_database()
def _ensure_database(self):
"""确保数据库文件和表结构存在"""
os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
with self.get_connection() as conn:
cursor = conn.cursor()
# 创建任务表
cursor.execute('''
CREATE TABLE IF NOT EXISTS tasks (
task_id TEXT PRIMARY KEY,
url TEXT NOT NULL,
months REAL NOT NULL,
use_proxy INTEGER NOT NULL,
proxy_api_url TEXT,
username TEXT,
status TEXT NOT NULL,
created_at TEXT NOT NULL,
started_at TEXT,
completed_at TEXT,
paused_at TEXT,
progress INTEGER DEFAULT 0,
current_step TEXT,
total_articles INTEGER DEFAULT 0,
processed_articles INTEGER DEFAULT 0,
error TEXT,
result_file TEXT,
retry_count INTEGER DEFAULT 0,
last_error TEXT,
articles_only INTEGER DEFAULT 1,
last_page INTEGER DEFAULT 0,
last_ctime TEXT
)
''')
# 创建任务日志表
cursor.execute('''
CREATE TABLE IF NOT EXISTS task_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
task_id TEXT NOT NULL,
timestamp TEXT NOT NULL,
message TEXT NOT NULL,
level TEXT DEFAULT 'info',
FOREIGN KEY (task_id) REFERENCES tasks(task_id) ON DELETE CASCADE
)
''')
# 创建文章缓存表(用于断点续传)
cursor.execute('''
CREATE TABLE IF NOT EXISTS article_cache (
id INTEGER PRIMARY KEY AUTOINCREMENT,
task_id TEXT NOT NULL,
title TEXT NOT NULL,
url TEXT,
publish_time TEXT,
page_num INTEGER,
created_at TEXT NOT NULL,
FOREIGN KEY (task_id) REFERENCES tasks(task_id) ON DELETE CASCADE
)
''')
# 创建索引提升查询性能
cursor.execute('''
CREATE INDEX IF NOT EXISTS idx_tasks_status
ON tasks(status)
''')
cursor.execute('''
CREATE INDEX IF NOT EXISTS idx_tasks_username
ON tasks(username)
''')
cursor.execute('''
CREATE INDEX IF NOT EXISTS idx_tasks_created_at
ON tasks(created_at DESC)
''')
# 为日志表创建索引
cursor.execute('''
CREATE INDEX IF NOT EXISTS idx_task_logs_task_id
ON task_logs(task_id)
''')
cursor.execute('''
CREATE INDEX IF NOT EXISTS idx_task_logs_timestamp
ON task_logs(timestamp)
''')
# 为文章缓存表创建索引
cursor.execute('''
CREATE INDEX IF NOT EXISTS idx_article_cache_task_id
ON article_cache(task_id)
''')
cursor.execute('''
CREATE INDEX IF NOT EXISTS idx_article_cache_page
ON article_cache(task_id, page_num)
''')
conn.commit()
logger.info(f"数据库初始化完成: {self.db_path}")
@contextmanager
def get_connection(self):
"""获取线程安全的数据库连接(上下文管理器)"""
if not hasattr(self._local, 'conn') or self._local.conn is None:
self._local.conn = sqlite3.connect(
self.db_path,
check_same_thread=False,
timeout=30.0
)
# 设置返回字典而不是元组
self._local.conn.row_factory = sqlite3.Row
try:
yield self._local.conn
except Exception as e:
self._local.conn.rollback()
logger.error(f"数据库操作失败: {e}")
raise
def close(self):
"""关闭当前线程的数据库连接"""
if hasattr(self._local, 'conn') and self._local.conn is not None:
self._local.conn.close()
self._local.conn = None
def add_task_log(self, task_id, message, level='info', timestamp=None):
"""添加任务日志
Args:
task_id: 任务ID
message: 日志消息
level: 日志级别 (info/success/warning/error)
timestamp: 时间戳,默认为当前时间
"""
if timestamp is None:
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
with self.get_connection() as conn:
cursor = conn.cursor()
cursor.execute('''
INSERT INTO task_logs (task_id, timestamp, message, level)
VALUES (?, ?, ?, ?)
''', (task_id, timestamp, message, level))
conn.commit()
def get_task_logs(self, task_id, limit=None):
"""获取任务的所有日志
Args:
task_id: 任务ID
limit: 限制返回数量,默认返回所有
Returns:
list: 日志列表,按时间顺序
"""
with self.get_connection() as conn:
cursor = conn.cursor()
if limit:
cursor.execute('''
SELECT task_id, timestamp, message, level
FROM task_logs
WHERE task_id = ?
ORDER BY id ASC
LIMIT ?
''', (task_id, limit))
else:
cursor.execute('''
SELECT task_id, timestamp, message, level
FROM task_logs
WHERE task_id = ?
ORDER BY id ASC
''', (task_id,))
rows = cursor.fetchall()
return [dict(row) for row in rows]
def clear_task_logs(self, task_id):
"""清除任务的所有日志
Args:
task_id: 任务ID
"""
with self.get_connection() as conn:
cursor = conn.cursor()
cursor.execute('DELETE FROM task_logs WHERE task_id = ?', (task_id,))
conn.commit()
def save_articles_batch(self, task_id, articles, page_num):
"""批量保存文章到缓存表
Args:
task_id: 任务ID
articles: 文章列表 [{'title': ..., 'url': ..., 'publish_time': ...}, ...]
page_num: 页码
"""
if not articles:
return
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
with self.get_connection() as conn:
cursor = conn.cursor()
# 批量插入
data = [
(task_id,
article.get('标题', article.get('title', '')),
article.get('链接', article.get('url', '')),
article.get('发布时间', article.get('publish_time', '')),
page_num,
timestamp)
for article in articles
]
cursor.executemany('''
INSERT INTO article_cache (task_id, title, url, publish_time, page_num, created_at)
VALUES (?, ?, ?, ?, ?, ?)
''', data)
conn.commit()
logger.debug(f"保存 {len(articles)} 篇文章到缓存(任务{task_id},第{page_num}页)")
def get_cached_articles(self, task_id):
"""获取任务的所有缓存文章
Args:
task_id: 任务ID
Returns:
list: 文章列表
"""
with self.get_connection() as conn:
cursor = conn.cursor()
cursor.execute('''
SELECT title, url, publish_time, page_num
FROM article_cache
WHERE task_id = ?
ORDER BY id ASC
''', (task_id,))
rows = cursor.fetchall()
return [
{
'标题': row['title'],
'链接': row['url'],
'发布时间': row['publish_time'],
'page_num': row['page_num']
}
for row in rows
]
def get_cached_article_count(self, task_id):
"""获取任务已缓存的文章数量
Args:
task_id: 任务ID
Returns:
int: 文章数量
"""
with self.get_connection() as conn:
cursor = conn.cursor()
cursor.execute('''
SELECT COUNT(*) as count FROM article_cache WHERE task_id = ?
''', (task_id,))
result = cursor.fetchone()
return result['count'] if result else 0
def clear_article_cache(self, task_id):
"""清除任务的所有缓存文章
Args:
task_id: 任务ID
"""
with self.get_connection() as conn:
cursor = conn.cursor()
cursor.execute('DELETE FROM article_cache WHERE task_id = ?', (task_id,))
conn.commit()
logger.debug(f"清除任务 {task_id} 的所有缓存文章")
# 全局数据库实例
_db_instance = None
_db_lock = threading.Lock()
def get_database():
"""获取全局数据库实例(单例模式)"""
global _db_instance
if _db_instance is None:
with _db_lock:
if _db_instance is None:
_db_instance = Database()
return _db_instance
def migrate_from_json(json_file="data/task_queue.json"):
"""从 JSON 文件迁移数据到 SQLite 数据库
Args:
json_file: 原 JSON 文件路径
Returns:
migrated_count: 成功迁移的任务数量
"""
import json
if not os.path.exists(json_file):
logger.info("未找到旧的 JSON 文件,跳过数据迁移")
return 0
try:
# 读取 JSON 数据
with open(json_file, 'r', encoding='utf-8') as f:
tasks = json.load(f)
if not tasks:
logger.info("JSON 文件中没有任务数据")
return 0
db = get_database()
migrated_count = 0
with db.get_connection() as conn:
cursor = conn.cursor()
for task in tasks:
try:
# 检查任务是否已存在
cursor.execute(
"SELECT task_id FROM tasks WHERE task_id = ?",
(task["task_id"],)
)
if cursor.fetchone():
logger.debug(f"任务 {task['task_id']} 已存在,跳过")
continue
# 插入任务数据
cursor.execute('''
INSERT INTO tasks (
task_id, url, months, use_proxy, proxy_api_url,
username, status, created_at, started_at, completed_at,
progress, current_step, total_articles, processed_articles,
error, result_file
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (
task["task_id"],
task["url"],
task["months"],
1 if task["use_proxy"] else 0,
task.get("proxy_api_url"),
task.get("username"),
task["status"],
task["created_at"],
task.get("started_at"),
task.get("completed_at"),
task.get("progress", 0),
task.get("current_step"),
task.get("total_articles", 0),
task.get("processed_articles", 0),
task.get("error"),
task.get("result_file")
))
migrated_count += 1
except Exception as e:
logger.error(f"迁移任务 {task.get('task_id')} 失败: {e}")
conn.commit()
logger.info(f"成功迁移 {migrated_count} 个任务到数据库")
# 备份原 JSON 文件
backup_file = json_file + ".backup"
if migrated_count > 0:
import shutil
shutil.copy2(json_file, backup_file)
logger.info(f"原 JSON 文件已备份到: {backup_file}")
return migrated_count
except Exception as e:
logger.error(f"数据迁移失败: {e}")
return 0

43
demo_python.py Normal file
View File

@@ -0,0 +1,43 @@
import requests
import time
if __name__ == '__main__':
# 客户ip提取链接,每次提取1个提取链接可以换成自己购买的
url = 'http://api.tianqiip.com/getip?secret=lu29e593&num=1&type=txt&port=1&mr=1&sign=4b81a62eaed89ba802a8f34053e2c964'
# 访问的目标地址
targeturl = 'https://mbd.baidu.com/webpage?tab=main&num=10&uk=ntHidnLhrlfclJar2z8wBg&source=pc&type=newhome&action=dynamic&format=jsonp&otherext=h5_20251126173230&Tenger-Mhor=3659421940&callback=__jsonp01765201579331'
response = requests.get(url)
content =response.content.decode("utf-8").strip()
print('提取IP' + content)
nowtime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
print('提取IP时间' + nowtime)
sj = content.strip().split(":", 1)
sj1 = sj[0]
print("IP", sj1)
sj2 = sj[1]
print("端口:", sj2)
try:
#proxyMeta = "http://nfd0p2:bHQAp5iW@%(host)s:%(port)s" % { # 账密验证,需要购买的代理套餐开通才可使用账密验证,此种情况无需加白名单
proxyMeta = f"http://{sj1}:{sj2}"
print("代理1", proxyMeta)
proxysdata = {
'http': proxyMeta,
'https': proxyMeta
}
print("代理2", proxysdata)
headers = {
"user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.'
}
start = int(round(time.time() * 1000))
resp = requests.get(targeturl, proxies=proxysdata, headers=headers, timeout=20)
costTime = int(round(time.time() * 1000)) - start
print("耗时:" + str(costTime) + "ms")
print("返回:",resp.text)
s = requests.session()
s.keep_alive = False
except Exception as e:
print("异常:",e)

43
fix_taskworker.sh Normal file
View File

@@ -0,0 +1,43 @@
#!/bin/bash
# 快速诊断和修复脚本
echo "======================================"
echo " TaskWorker 快速诊断和修复工具"
echo "======================================"
echo ""
# 检查 Python 环境
if ! command -v python3 &> /dev/null; then
echo "❌ Python3 未找到"
exit 1
fi
# 进入脚本所在目录
cd "$(dirname "$0")"
echo "[1] 检查 TaskWorker 状态..."
python3 check_taskworker.py
echo ""
read -p "是否需要修复? (y/n): " -n 1 -r
echo ""
if [[ $REPLY =~ ^[Yy]$ ]]; then
echo ""
echo "[2] 正在修复..."
python3 check_taskworker.py --fix
echo ""
echo "[3] 重新检查状态..."
sleep 2
python3 check_taskworker.py
echo ""
echo "✅ 修复完成!"
else
echo ""
echo "已取消修复"
fi
echo ""
echo "======================================"

168
gunicorn_config.py Normal file
View File

@@ -0,0 +1,168 @@
# -*- coding: utf-8 -*-
"""
Gunicorn 配置文件
"""
import multiprocessing
import os
# 服务器绑定地址
bind = "0.0.0.0:8030"
# 工作进程数建议CPU核心数 * 2 + 1
workers = multiprocessing.cpu_count() * 2 + 1
# 工作模式gevent 适合 I/O 密集型应用,如爬虫)
# 需要安装: pip install gevent
# worker_class = 'gevent'
# 或使用线程模式(适合任务队列)
worker_class = 'gthread'
threads = 2
# 最大并发请求数
worker_connections = 1000
# 工作进程超时时间(秒)
timeout = 300
# 优雅重启超时时间
graceful_timeout = 30
# Keep-alive 时间
keepalive = 5
# 守护进程模式(后台运行)
# 注意调试时可以设置为False查看详细日志
daemon = False
# 进程 PID 文件
pidfile = 'gunicorn.pid'
# 日志配置
accesslog = 'logs/gunicorn_access.log'
errorlog = 'logs/gunicorn_error.log'
loglevel = 'info'
# 访问日志格式
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" %(D)s'
# 进程名称
proc_name = 'baijiahao_scraper'
# 最大请求数(防止内存泄漏)
max_requests = 1000
max_requests_jitter = 50
# 预加载应用(节省内存)
# 注意由于TaskWorker需要在worker进程中启动设置为False
preload_app = False
# 环境变量
raw_env = [
'FLASK_ENV=production',
]
# 工作进程启动时的回调
def on_starting(server):
"""服务器启动时"""
import os
print("=" * 50)
print("Gunicorn 服务启动中...")
print(f"绑定地址: {bind}")
print(f"工作进程数: {workers}")
print(f"工作模式: {worker_class}")
# 清理旧的TaskWorker锁文件
lock_file = 'data/taskworker.lock'
if os.path.exists(lock_file):
try:
os.remove(lock_file)
print("✓ 已清理旧的TaskWorker锁文件")
except:
pass
print("=" * 50)
def when_ready(server):
"""服务器就绪时"""
print("✓ 服务器已就绪,可以接受请求")
def post_worker_init(worker):
"""worker进程初始化后的钩子 - 只在第一个worker中启动TaskWorker"""
import os
import sys
import logging
import time
import fcntl # 用于文件锁
# 设置日志直接输出到gunicorn error log
logger = logging.getLogger('gunicorn.error')
# 创建必要的目录
os.makedirs('exports', exist_ok=True)
os.makedirs('data', exist_ok=True)
os.makedirs('data/results', exist_ok=True)
os.makedirs('logs', exist_ok=True)
# 使用文件锁确保只有一个worker启动TaskWorker
lock_file_path = 'data/taskworker.lock'
lock_file = None
try:
# 打开锁文件(不存在则创建)
lock_file = open(lock_file_path, 'w')
# 尝试获取排他锁(非阻塞)
try:
fcntl.flock(lock_file.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
# 成功获得锁启动TaskWorker
logger.info(f"[Worker {worker.pid}] 获得锁准备启动TaskWorker...")
lock_file.write(str(worker.pid))
lock_file.flush()
try:
from task_worker import start_task_worker, get_task_worker
start_task_worker()
# 验证启动状态
time.sleep(1)
task_worker = get_task_worker()
if task_worker.running:
logger.info(f"[Worker {worker.pid}] ✅ TaskWorker已成功启动主 worker")
logger.info(f"[Worker {worker.pid}] 并发数: {task_worker.current_workers}/{task_worker.max_workers}")
else:
logger.error(f"[Worker {worker.pid}] ⚠️ TaskWorker启动后未运行")
except Exception as e:
logger.error(f"[Worker {worker.pid}] TaskWorker启动失败: {e}")
import traceback
logger.error(traceback.format_exc())
# 释放锁
fcntl.flock(lock_file.fileno(), fcntl.LOCK_UN)
lock_file.close()
except IOError:
# 锁已被其他进程持有
logger.info(f"[Worker {worker.pid}] 跳过TaskWorker启动其他worker已启动")
lock_file.close()
except Exception as e:
logger.error(f"[Worker {worker.pid}] TaskWorker启动异常: {e}")
import traceback
logger.error(traceback.format_exc())
if lock_file:
lock_file.close()
def on_exit(server):
"""服务器退出时"""
import os
# 清理TaskWorker锁文件
lock_file = 'data/taskworker.lock'
if os.path.exists(lock_file):
try:
os.remove(lock_file)
except:
pass
print("✓ Gunicorn 服务已停止")

51
install_service.sh Normal file
View File

@@ -0,0 +1,51 @@
#!/bin/bash
###############################################################################
# systemd 服务安装脚本
###############################################################################
# 颜色定义
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
NC='\033[0m'
echo -e "${YELLOW}=========================================${NC}"
echo -e "${YELLOW} 安装 systemd 服务${NC}"
echo -e "${YELLOW}=========================================${NC}"
echo ""
# 检查是否为root
if [ "$EUID" -ne 0 ]; then
echo -e "${RED}请使用 sudo 运行此脚本${NC}"
exit 1
fi
PROJECT_DIR="/var/www/ai_baijiahao"
SERVICE_FILE="baijiahao.service"
# 复制服务文件
echo "复制服务文件到 /etc/systemd/system/..."
cp ${PROJECT_DIR}/${SERVICE_FILE} /etc/systemd/system/
# 重载 systemd
echo "重载 systemd 配置..."
systemctl daemon-reload
# 启用服务(开机自启)
echo "启用服务开机自启..."
systemctl enable baijiahao
echo ""
echo -e "${GREEN}=========================================${NC}"
echo -e "${GREEN} 服务安装完成!${NC}"
echo -e "${GREEN}=========================================${NC}"
echo ""
echo "常用命令:"
echo -e " 启动服务: ${YELLOW}sudo systemctl start baijiahao${NC}"
echo -e " 停止服务: ${YELLOW}sudo systemctl stop baijiahao${NC}"
echo -e " 重启服务: ${YELLOW}sudo systemctl restart baijiahao${NC}"
echo -e " 查看状态: ${YELLOW}sudo systemctl status baijiahao${NC}"
echo -e " 查看日志: ${YELLOW}sudo journalctl -u baijiahao -f${NC}"
echo -e " 禁用自启: ${YELLOW}sudo systemctl disable baijiahao${NC}"
echo ""

85
migrate_database.py Normal file
View File

@@ -0,0 +1,85 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
数据库迁移脚本
为tasks表添加新字段paused_at, retry_count, last_error, articles_only
"""
import sqlite3
import os
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def migrate_database():
"""执行数据库迁移"""
db_path = "data/baijiahao.db"
if not os.path.exists(db_path):
logger.info("数据库文件不存在,无需迁移")
return
try:
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# 检查表是否存在
cursor.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='tasks'")
if not cursor.fetchone():
logger.info("tasks表不存在无需迁移")
conn.close()
return
# 获取当前表结构
cursor.execute("PRAGMA table_info(tasks)")
columns = {row[1]: row for row in cursor.fetchall()}
logger.info("开始数据库迁移...")
# 添加 paused_at 字段
if 'paused_at' not in columns:
logger.info("添加 paused_at 字段...")
cursor.execute("ALTER TABLE tasks ADD COLUMN paused_at TEXT")
logger.info("✓ paused_at 字段添加成功")
else:
logger.info("✓ paused_at 字段已存在")
# 添加 retry_count 字段
if 'retry_count' not in columns:
logger.info("添加 retry_count 字段...")
cursor.execute("ALTER TABLE tasks ADD COLUMN retry_count INTEGER DEFAULT 0")
logger.info("✓ retry_count 字段添加成功")
else:
logger.info("✓ retry_count 字段已存在")
# 添加 last_error 字段
if 'last_error' not in columns:
logger.info("添加 last_error 字段...")
cursor.execute("ALTER TABLE tasks ADD COLUMN last_error TEXT")
logger.info("✓ last_error 字段添加成功")
else:
logger.info("✓ last_error 字段已存在")
# 添加 articles_only 字段
if 'articles_only' not in columns:
logger.info("添加 articles_only 字段...")
cursor.execute("ALTER TABLE tasks ADD COLUMN articles_only INTEGER DEFAULT 1")
logger.info("✓ articles_only 字段添加成功")
else:
logger.info("✓ articles_only 字段已存在")
conn.commit()
conn.close()
logger.info("=" * 50)
logger.info("✅ 数据库迁移完成!")
logger.info("=" * 50)
except Exception as e:
logger.error(f"❌ 数据库迁移失败: {e}")
import traceback
traceback.print_exc()
raise
if __name__ == "__main__":
migrate_database()

59
migrate_database_v2.py Normal file
View File

@@ -0,0 +1,59 @@
# -*- coding: utf-8 -*-
"""
数据库迁移脚本 V2
添加断点续传支持字段
"""
import sqlite3
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def migrate_database():
"""执行数据库迁移"""
db_path = "data/baijiahao.db"
logger.info("=" * 50)
logger.info("开始数据库迁移 V2...")
logger.info("=" * 50)
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
try:
# 获取当前表结构
cursor.execute("PRAGMA table_info(tasks)")
columns = {row[1]: row for row in cursor.fetchall()}
# 添加 last_page 字段(最后抓取的页码)
if 'last_page' not in columns:
logger.info("添加 last_page 字段...")
cursor.execute("ALTER TABLE tasks ADD COLUMN last_page INTEGER DEFAULT 0")
logger.info("✓ last_page 字段添加成功")
else:
logger.info("✓ last_page 字段已存在")
# 添加 last_ctime 字段(最后的分页参数)
if 'last_ctime' not in columns:
logger.info("添加 last_ctime 字段...")
cursor.execute("ALTER TABLE tasks ADD COLUMN last_ctime TEXT")
logger.info("✓ last_ctime 字段添加成功")
else:
logger.info("✓ last_ctime 字段已存在")
conn.commit()
logger.info("=" * 50)
logger.info("✅ 数据库迁移 V2 完成!")
logger.info("=" * 50)
except Exception as e:
logger.error(f"❌ 迁移失败: {e}")
conn.rollback()
raise
finally:
conn.close()
if __name__ == "__main__":
migrate_database()

35
remove_selenium.py Normal file
View File

@@ -0,0 +1,35 @@
# -*- coding: utf-8 -*-
# 移除 app.py 中的 Selenium 相关代码
with open('app.py', 'r', encoding='utf-8') as f:
lines = f.readlines()
# 定义要删除的行范围(包含开始和结束)
# get_articles_with_selenium_api: 451-665
# get_articles_with_selenium: 666-1038
# _extract_articles_from_page: 1093-1211
delete_ranges = [
(451, 665), # get_articles_with_selenium_api
(666, 1038), # get_articles_with_selenium
(1093, 1211) # _extract_articles_from_page
]
output_lines = []
for i, line in enumerate(lines, 1):
should_keep = True
for start, end in delete_ranges:
if start <= i <= end:
should_keep = False
break
if should_keep:
output_lines.append(line)
with open('app_refactored.py', 'w', encoding='utf-8') as f:
f.writelines(output_lines)
print(f"✅ 原始行数: {len(lines)}")
print(f"✅ 删除后行数: {len(output_lines)}")
print(f"✅ 已删除: {len(lines) - len(output_lines)}")
print(f"✅ 新文件已保存为: app_refactored.py")

12
requirements.txt Normal file
View File

@@ -0,0 +1,12 @@
Flask==3.0.0
flask-cors==4.0.0
flask-socketio==5.3.6
python-socketio==5.11.0
requests==2.31.0
pandas==2.1.4
openpyxl==3.1.2
psutil==5.9.6
gunicorn==21.2.0
aiohttp==3.9.1
fake-useragent==1.4.0
playwright==1.40.0

12
restart.sh Normal file
View File

@@ -0,0 +1,12 @@
#!/bin/bash
# 重启服务
echo "正在停止服务..."
bash stop.sh
sleep 2
echo "正在启动服务..."
bash start.sh
echo "服务已重启"

32
scrapy_proxy.py Normal file
View File

@@ -0,0 +1,32 @@
import scrapy
class MimvpSpider(scrapy.spiders.Spider):
name = "mimvp"
allowed_domains = ["mimvp.com"]
start_urls = [
"http://proxy.mimvp.com/exist.php",
"https://proxy.mimvp.com/exist.php",
]
## <20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>÷<EFBFBD>ʽ1<CABD><31>ֱ<EFBFBD><D6B1><EFBFBD>ڴ<EFBFBD><DAB4><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>
def start_requests(self):
urls = [
"http://proxy.mimvp.com/exist.php",
"https://proxy.mimvp.com/exist.php",
]
for url in urls:
meta_proxy = ""
if url.startswith("http://"):
meta_proxy = "http://180.96.27.12:88" # http<74><70><EFBFBD><EFBFBD>
elif url.startswith("https://"):
meta_proxy = "http://109.108.87.136:53281" # https<70><73><EFBFBD><EFBFBD>
yield scrapy.Request(url=url, callback=self.parse, meta={'proxy': meta_proxy})
def parse(self, response):
mimvp_url = response.url # <20><>ȡʱ<C8A1><CAB1><EFBFBD><EFBFBD><EFBFBD><EFBFBD>url
body = response.body # <20><><EFBFBD><EFBFBD><EFBFBD><EFBFBD>ҳ<EFBFBD><D2B3><EFBFBD><EFBFBD>
print("mimvp_url : " + str(mimvp_url))
print("body : " + str(body))

356
start.sh Normal file
View File

@@ -0,0 +1,356 @@
#!/bin/bash
###############################################################################
# 百家号爬虫系统 - 快速启动脚本
# 功能:杀死旧进程 -> 激活虚拟环境 -> 启动服务 -> 检查状态
# 支持nohup / gunicorn 两种启动方式
# 新增TaskWorker 监控和健康检查
###############################################################################
# 颜色定义
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
# 项目配置
PROJECT_DIR="$(cd "$(dirname "$0")" && pwd)"
VENV_DIR="${PROJECT_DIR}/venv"
APP_PORT=8030
PID_FILE="${PROJECT_DIR}/app.pid"
GUNICORN_PID_FILE="${PROJECT_DIR}/gunicorn.pid"
MONITOR_PID_FILE="${PROJECT_DIR}/monitor.pid"
# 启动模式(默认使用 gunicorn
START_MODE="${1:-gunicorn}" # gunicorn 或 nohup
ENABLE_MONITOR="${2:-yes}" # 是否启动监控yes/no
echo -e "${BLUE}=========================================${NC}"
echo -e "${BLUE} 百家号爬虫系统 - 启动服务${NC}"
echo -e "${BLUE} 启动模式: ${START_MODE}${NC}"
echo -e "${BLUE} 自动监控: ${ENABLE_MONITOR}${NC}"
echo -e "${BLUE}=========================================${NC}"
echo ""
###############################################################################
# 1. 杀死旧进程
###############################################################################
echo -e "${YELLOW}[1/3]${NC} 检查并停止旧服务..."
# 通过 PID 文件停止
if [[ -f "${PID_FILE}" ]]; then
OLD_PID=$(cat ${PID_FILE})
if ps -p ${OLD_PID} > /dev/null 2>&1; then
echo " 发现旧进程 (PID: ${OLD_PID}),正在停止..."
kill ${OLD_PID} 2>/dev/null
sleep 2
# 如果还在运行,强制终止
if ps -p ${OLD_PID} > /dev/null 2>&1; then
echo " 强制终止进程..."
kill -9 ${OLD_PID} 2>/dev/null
fi
echo -e " ${GREEN}${NC} 旧进程已停止"
fi
rm -f ${PID_FILE}
fi
# 停止 Gunicorn
if [[ -f "${GUNICORN_PID_FILE}" ]]; then
GUNICORN_PID=$(cat ${GUNICORN_PID_FILE})
if ps -p ${GUNICORN_PID} > /dev/null 2>&1; then
echo " 发现 Gunicorn 进程 (PID: ${GUNICORN_PID}),正在停止..."
kill ${GUNICORN_PID} 2>/dev/null
sleep 2
if ps -p ${GUNICORN_PID} > /dev/null 2>&1; then
echo " 强制终止 Gunicorn..."
kill -9 ${GUNICORN_PID} 2>/dev/null
fi
echo -e " ${GREEN}${NC} Gunicorn 进程已停止"
fi
rm -f ${GUNICORN_PID_FILE}
fi
# 杀死所有 app.py 相关进程(包括任务线程)
APP_PIDS=$(ps aux | grep "[p]ython.*app.py" | awk '{print $2}')
if [[ -n "${APP_PIDS}" ]]; then
echo " 发现运行中的 Python 进程,正在清理..."
for pid in ${APP_PIDS}; do
echo " 停止进程 ${pid}..."
kill ${pid} 2>/dev/null
done
sleep 2
# 强制清理残留进程
APP_PIDS=$(ps aux | grep "[p]ython.*app.py" | awk '{print $2}')
if [[ -n "${APP_PIDS}" ]]; then
echo " 强制清理残留进程..."
for pid in ${APP_PIDS}; do
kill -9 ${pid} 2>/dev/null
done
fi
echo -e " ${GREEN}${NC} 所有旧进程已清理"
else
echo -e " ${GREEN}${NC} 未发现运行中的进程"
fi
# 清理端口占用
PORT_PID=$(lsof -ti:${APP_PORT} 2>/dev/null)
if [[ -n "${PORT_PID}" ]]; then
echo " 端口 ${APP_PORT} 被占用 (PID: ${PORT_PID}),正在释放..."
kill -9 ${PORT_PID} 2>/dev/null
echo -e " ${GREEN}${NC} 端口已释放"
fi
# 停止监控进程
if [[ -f "${MONITOR_PID_FILE}" ]]; then
MONITOR_PID=$(cat ${MONITOR_PID_FILE})
if ps -p ${MONITOR_PID} > /dev/null 2>&1; then
echo " 发现监控进程 (PID: ${MONITOR_PID}),正在停止..."
kill ${MONITOR_PID} 2>/dev/null
sleep 1
echo -e " ${GREEN}${NC} 监控进程已停止"
fi
rm -f ${MONITOR_PID_FILE}
fi
# 清理 TaskWorker 锁文件
if [[ -f "data/taskworker.lock" ]]; then
echo " 清理 TaskWorker 锁文件..."
rm -f data/taskworker.lock
echo -e " ${GREEN}${NC} 锁文件已清理"
fi
echo ""
###############################################################################
# 2. 激活虚拟环境
###############################################################################
echo -e "${YELLOW}[2/3]${NC} 激活虚拟环境..."
if [[ ! -d "${VENV_DIR}" ]]; then
echo -e " ${RED}${NC} 虚拟环境不存在: ${VENV_DIR}"
echo " 请先创建虚拟环境:"
echo " python3 -m venv .venv"
echo " source .venv/bin/activate"
echo " pip install -r requirements.txt"
exit 1
fi
source ${VENV_DIR}/bin/activate
if [[ "$VIRTUAL_ENV" != "" ]]; then
echo -e " ${GREEN}${NC} 虚拟环境已激活: ${VIRTUAL_ENV}"
else
echo -e " ${RED}${NC} 虚拟环境激活失败"
exit 1
fi
echo ""
###############################################################################
# 3. 启动服务
###############################################################################
echo -e "${YELLOW}[3/5]${NC} 启动服务..."
cd ${PROJECT_DIR}
if [[ "${START_MODE}" == "gunicorn" ]]; then
# 使用 Gunicorn 启动
echo " 使用 Gunicorn 启动服务..."
# 检查 gunicorn 是否安装
if ! command -v gunicorn &> /dev/null; then
echo -e " ${RED}${NC} Gunicorn 未安装,请先安装:"
echo " pip install gunicorn"
exit 1
fi
# 后台启动 Gunicorn
gunicorn -c gunicorn_config.py app:app
# 等待服务启动daemon模式需要更长时间
echo " 等待服务启动..."
sleep 5
# 检查服务是否成功启动
if [[ -f "${GUNICORN_PID_FILE}" ]]; then
GUNICORN_PID=$(cat ${GUNICORN_PID_FILE})
if ps -p ${GUNICORN_PID} > /dev/null 2>&1; then
echo ""
echo -e "${GREEN}=========================================${NC}"
echo -e "${GREEN} Gunicorn 服务启动成功!${NC}"
echo -e "${GREEN}=========================================${NC}"
echo ""
echo -e " PID: ${GUNICORN_PID}"
echo -e " 端口: ${APP_PORT}"
echo -e " 访问地址: http://0.0.0.0:${APP_PORT}"
echo -e " 访问日志: logs/gunicorn_access.log"
echo -e " 错误日志: logs/gunicorn_error.log"
echo ""
echo -e "查看日志: ${BLUE}tail -f logs/gunicorn_error.log${NC}"
echo -e "停止服务: ${BLUE}./stop.sh${NC}"
echo -e "重启服务: ${BLUE}./restart.sh${NC}"
echo -e "检查状态: ${BLUE}python check_taskworker.py${NC}"
echo ""
else
echo ""
echo -e "${RED}=========================================${NC}"
echo -e "${RED} Gunicorn 启动失败!${NC}"
echo -e "${RED}=========================================${NC}"
echo ""
echo "请检查日志文件:"
echo " tail -n 50 logs/gunicorn_error.log"
echo ""
exit 1
fi
else
# 尝试通过端口检查服务是否启动
echo " PID文件未生成检查端口占用..."
sleep 2
if lsof -i:${APP_PORT} > /dev/null 2>&1; then
echo ""
echo -e "${GREEN}=========================================${NC}"
echo -e "${GREEN} Gunicorn 服务已启动!${NC}"
echo -e "${GREEN}=========================================${NC}"
echo ""
echo -e " 端口: ${APP_PORT}"
echo -e " 访问地址: http://0.0.0.0:${APP_PORT}"
echo -e " 访问日志: logs/gunicorn_access.log"
echo -e " 错误日志: logs/gunicorn_error.log"
echo ""
echo -e "查看日志: ${BLUE}tail -f logs/gunicorn_error.log${NC}"
echo -e "停止服务: ${BLUE}./stop.sh${NC}"
echo -e "检查状态: ${BLUE}python check_taskworker.py${NC}"
echo ""
else
echo -e "${RED}${NC} 服务启动失败,请检查错误日志:"
echo " tail -n 50 logs/gunicorn_error.log"
exit 1
fi
fi
else
# 使用 nohup 启动
echo " 使用 nohup 启动服务..."
# 后台启动
nohup python app.py > logs/app.log 2>&1 &
NEW_PID=$!
# 保存 PID
echo ${NEW_PID} > ${PID_FILE}
# 等待服务启动
echo " 等待服务启动..."
sleep 3
# 检查服务是否成功启动
if ps -p ${NEW_PID} > /dev/null 2>&1; then
echo ""
echo -e "${GREEN}=========================================${NC}"
echo -e "${GREEN} 服务启动成功!${NC}"
echo -e "${GREEN}=========================================${NC}"
echo ""
echo -e " PID: ${NEW_PID}"
echo -e " 端口: ${APP_PORT}"
echo -e " 访问地址: http://127.0.0.1:${APP_PORT}"
echo -e " 日志文件: logs/app.log"
echo ""
echo -e "查看日志: ${BLUE}tail -f logs/app.log${NC}"
echo -e "停止服务: ${BLUE}kill ${NEW_PID}${NC}"
echo -e "检查状态: ${BLUE}python check_taskworker.py${NC}"
echo ""
else
echo ""
echo -e "${RED}=========================================${NC}"
echo -e "${RED} 服务启动失败!${NC}"
echo -e "${RED}=========================================${NC}"
echo ""
echo "请检查日志文件:"
echo " tail -n 50 logs/app.log"
echo ""
exit 1
fi
fi
echo ""
###############################################################################
# 4. 检查 TaskWorker 状态
###############################################################################
echo -e "${YELLOW}[4/5]${NC} 检查 TaskWorker 状态..."
# 等待服务完全启动
sleep 3
# 检查健康状态
if command -v curl &> /dev/null; then
HEALTH_CHECK=$(curl -s http://localhost:${APP_PORT}/health/taskworker 2>/dev/null)
if [[ -n "${HEALTH_CHECK}" ]]; then
STATUS=$(echo ${HEALTH_CHECK} | grep -o '"status":"[^"]*"' | cut -d'"' -f4)
if [[ "${STATUS}" == "healthy" ]]; then
echo -e " ${GREEN}${NC} TaskWorker 状态: ${GREEN}healthy${NC}"
# 提取详细信息
ALIVE_THREADS=$(echo ${HEALTH_CHECK} | grep -o '"alive_threads":[0-9]*' | cut -d':' -f2)
PENDING=$(echo ${HEALTH_CHECK} | grep -o '"pending":[0-9]*' | cut -d':' -f2)
echo " 活跃线程: ${ALIVE_THREADS}"
echo " 待处理任务: ${PENDING}"
else
echo -e " ${YELLOW}${NC} TaskWorker 状态: ${YELLOW}${STATUS}${NC}"
echo " 建议运行: python check_taskworker.py --fix"
fi
else
echo -e " ${YELLOW}${NC} 无法连接到服务,稍后重试"
fi
else
echo " 跳过健康检查curl 未安装)"
fi
echo ""
###############################################################################
# 5. 启动监控进程(可选)
###############################################################################
if [[ "${ENABLE_MONITOR}" == "yes" ]]; then
echo -e "${YELLOW}[5/5]${NC} 启动 TaskWorker 自动监控..."
if [[ -f "taskworker_monitor.py" ]]; then
# 后台启动监控
nohup python taskworker_monitor.py > logs/monitor.out 2>&1 &
MONITOR_PID=$!
echo ${MONITOR_PID} > ${MONITOR_PID_FILE}
sleep 1
if ps -p ${MONITOR_PID} > /dev/null 2>&1; then
echo -e " ${GREEN}${NC} 监控进程已启动 (PID: ${MONITOR_PID})"
echo " 监控日志: logs/taskworker_monitor.log"
echo " 输出日志: logs/monitor.out"
else
echo -e " ${YELLOW}${NC} 监控进程启动失败"
fi
else
echo -e " ${YELLOW}${NC} 监控脚本不存在: taskworker_monitor.py"
fi
else
echo -e "${YELLOW}[5/5]${NC} 跳过自动监控(使用 './start.sh gunicorn yes' 启用)"
fi
echo ""
echo -e "${GREEN}=========================================${NC}"
echo -e "${GREEN} 服务启动完成!${NC}"
echo -e "${GREEN}=========================================${NC}"
echo ""

2078
static/css/bootstrap-icons.css vendored Normal file

File diff suppressed because it is too large Load Diff

5
static/css/bootstrap-icons.min.css vendored Normal file

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,81 @@
/* Bootstrap Icons - 本地精简版 */
/* 使用多CDN备份策略确保字体文件加载 */
@font-face {
font-display: block;
font-family: "bootstrap-icons";
src:
url("https://unpkg.com/bootstrap-icons@1.11.3/font/fonts/bootstrap-icons.woff2") format("woff2"),
url("https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.3/font/fonts/bootstrap-icons.woff2") format("woff2"),
url("https://cdn.bootcdn.net/ajax/libs/bootstrap-icons/1.11.3/font/bootstrap-icons.woff2") format("woff2"),
url("https://unpkg.com/bootstrap-icons@1.11.3/font/fonts/bootstrap-icons.woff") format("woff"),
url("https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.3/font/fonts/bootstrap-icons.woff") format("woff");
}
.bi::before,
[class^="bi-"]::before,
[class*=" bi-"]::before {
display: inline-block;
font-family: bootstrap-icons !important;
font-style: normal;
font-weight: normal !important;
font-variant: normal;
text-transform: none;
line-height: 1;
vertical-align: -.125em;
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
}
/* 项目中使用的图标 */
.bi-shield-lock-fill::before { content: "\f621"; }
.bi-person-fill::before { content: "\f4da"; }
.bi-key-fill::before { content: "\f494"; }
.bi-box-arrow-in-right::before { content: "\f1cb"; }
.bi-hourglass-split::before { content: "\f47e"; }
.bi-file-earmark-text::before { content: "\f32a"; }
.bi-person-circle::before { content: "\f4d6"; }
.bi-box-arrow-right::before { content: "\f1cd"; }
.bi-link-45deg::before { content: "\f4b3"; }
.bi-info-circle::before { content: "\f489"; }
.bi-cookie::before { content: "\f2a0"; }
.bi-calendar-range::before { content: "\f1e9"; }
.bi-shield-check::before { content: "\f61c"; }
.bi-file-earmark-spreadsheet::before { content: "\f324"; }
.bi-card-list::before { content: "\f1ed"; }
.bi-download::before { content: "\f30b"; }
.bi-plus-circle::before { content: "\f512"; }
.bi-1-circle::before { content: "\f657"; }
.bi-2-circle::before { content: "\f658"; }
.bi-3-circle::before { content: "\f659"; }
.bi-4-circle::before { content: "\f65a"; }
.bi-5-circle::before { content: "\f65b"; }
.bi-6-circle::before { content: "\f65c"; }
.bi-file-arrow-down::before { content: "\f310"; }
.bi-list-ul::before { content: "\f4bc"; }
.bi-clock::before { content: "\f279"; }
.bi-check-circle-fill::before { content: "\f26b"; }
.bi-three-dots::before { content: "\f62d"; }
.bi-book::before { content: "\f194"; }
.bi-exclamation-triangle::before { content: "\f33c"; }
.bi-inbox::before { content: "\f486"; }
.bi-list-task::before { content: "\f4ba"; }
.bi-cloud-download::before { content: "\f265"; }
.bi-file-earmark-arrow-down::before { content: "\f30e"; }
.bi-newspaper::before { content: "\f4ca"; }
.bi-kanban::before { content: "\f48d"; }
.bi-list-check::before { content: "\f4b6"; }
.bi-collection::before { content: "\f285"; }
.bi-arrow-repeat::before { content: "\f130"; }
.bi-check-circle::before { content: "\f26a"; }
.bi-x-circle::before { content: "\f623"; }
.bi-pause-circle::before { content: "\f4c2"; }
.bi-list::before { content: "\f4b4"; }
.bi-eye::before { content: "\f341"; }
.bi-trash::before { content: "\f5de"; }
.bi-file-text::before { content: "\f32d"; }
.bi-chevron-bar-left::before { content: "\f276"; }
.bi-chevron-left::before { content: "\f284"; }
.bi-chevron-right::before { content: "\f285"; }
.bi-chevron-bar-right::before { content: "\f277"; }
.bi-x::before { content: "\f62a"; }

1050
static/css/style.css Normal file

File diff suppressed because it is too large Load Diff

2
static/js/jquery.min.js vendored Normal file

File diff suppressed because one or more lines are too long

381
static/js/main.js Normal file
View File

@@ -0,0 +1,381 @@
// 检查jQuery是否加载
if (typeof jQuery === 'undefined') {
console.error('jQuery未加载请检查网络连接');
alert('jQuery加载失败请刷新页面或检查网络连接');
} else {
$(document).ready(function() {
let currentFilename = null;
let progressInterval = null;
let currentStep = 0;
const steps = [
{ name: '解析URL', percent: 5 },
{ name: '启动浏览器', percent: 15 },
{ name: '加载页面', percent: 30 },
{ name: '滚动获取文章', percent: 70 },
{ name: '提取数据', percent: 85 },
{ name: '生成Excel文件', percent: 95 }
];
// 登出按钮点击事件
$('#logoutBtn').click(function() {
if (confirm('确定要登出吗?')) {
$.ajax({
url: '/api/logout',
type: 'POST',
success: function(response) {
if (response.success) {
window.location.href = '/login';
}
},
error: function() {
window.location.href = '/login';
}
});
}
});
// 加载队列统计信息并更新徽章
function updateQueueBadge() {
$.ajax({
url: '/api/queue/stats',
type: 'GET',
success: function(response) {
if (response.success && response.stats) {
const pending = response.stats.pending || 0;
const processing = response.stats.processing || 0;
const total = pending + processing;
if (total > 0) {
$('#queueBadge').text(total).show();
} else {
$('#queueBadge').hide();
}
}
},
error: function() {
// 忽略错误,不显示徽章
$('#queueBadge').hide();
}
});
}
// 初始加载徽章
updateQueueBadge();
// 每10秒更新一次徽章
setInterval(updateQueueBadge, 10000);
// 导出按钮点击事件
$('#exportBtn').click(function() {
const url = $('#authorUrl').val().trim();
const cookies = $('#cookieInput').val().trim();
const months = parseFloat($('#monthsSelect').val()); // 改为parseFloat支持小数
const articlesOnly = $('#articlesOnlyCheckbox').is(':checked'); // 获取是否只爬取文章
// 验证URL
if (!url) {
showError('请输入百家号作者主页地址');
return;
}
if (!url.includes('baijiahao.baidu.com') || !url.includes('app_id=')) {
showError('URL格式不正确请输入完整的百家号作者主页地址');
return;
}
// 开始导出(始终使用默认代理)
startExport(url, cookies, months, articlesOnly, true, '');
});
// 下载按钮点击事件
$('#downloadBtn').click(function() {
if (currentFilename) {
window.location.href = `/api/download/${currentFilename}`;
}
});
// 输入框回车事件
$('#authorUrl').keypress(function(e) {
if (e.which === 13) {
$('#exportBtn').click();
}
});
// 开始导出
function startExport(url, cookies, months, articlesOnly, useProxy, proxyApiUrl) {
// 隐藏结果框和文章列表
$('#resultBox').hide();
$('#downloadBtn').hide();
$('#articlePreview').hide();
// 显示加载框
$('#loadingBox').show();
$('#progressDetails').show();
updateProgress('开始初始化...', 0);
// 启动进度模拟
startProgressSimulation();
// 禁用按钮
$('#exportBtn').prop('disabled', true);
// 构建请求数据(始终启用代理)
const requestData = {
url: url,
cookies: cookies || '',
months: months, // 直接使用months,不要用 || 6 因为0.33是有效值
use_proxy: useProxy, // 始终启用代理
proxy_api_url: proxyApiUrl || '',
articles_only: articlesOnly // 仅爬取文章
};
// 发送请求
$.ajax({
url: '/api/export',
type: 'POST',
contentType: 'application/json',
data: JSON.stringify(requestData),
success: function(response) {
if (response.success) {
currentFilename = response.filename;
completeAllSteps();
// 显示文章列表
if (response.articles && response.articles.length > 0) {
displayArticles(response.articles, response.count);
}
setTimeout(function() {
showSuccess(`导出成功!共获取到 ${response.count} 篇文章`);
$('#downloadBtn').show();
}, 500);
} else {
stopProgressSimulation();
showError(response.message || '导出失败');
}
},
error: function(xhr, status, error) {
stopProgressSimulation();
// 检查是否需要登录
if (xhr.status === 401 || (xhr.responseJSON && xhr.responseJSON.need_login)) {
alert('登录已过期,请重新登录');
window.location.href = '/login';
return;
}
let errorMessage = '导出失败,请检查网络连接或稍后重试';
if (xhr.responseJSON && xhr.responseJSON.message) {
errorMessage = xhr.responseJSON.message;
} else if (xhr.status === 0) {
errorMessage = '无法连接到服务器,请确保后端服务已启动';
} else if (xhr.status === 500) {
errorMessage = '服务器内部错误,请稍后重试';
}
showError(errorMessage);
},
complete: function() {
// 隐藏加载框
$('#loadingBox').hide();
$('#progressDetails').hide();
// 启用按钮
$('#exportBtn').prop('disabled', false);
}
});
}
// 更新进度显示
function updateProgress(message, percent) {
$('#progressMessage').text(message);
$('#progressBar').css('width', percent + '%');
$('#progressPercent').text(Math.round(percent) + '%');
}
// 启动进度模拟
function startProgressSimulation() {
currentStep = 0;
$('.step-item').removeClass('active completed');
progressInterval = setInterval(function() {
if (currentStep < steps.length) {
// 标记当前步骤为活跃
$('.step-item').eq(currentStep).addClass('active');
// 更新进度
updateProgress(steps[currentStep].name + '...', steps[currentStep].percent);
// 模拟步骤完成
setTimeout(function() {
let step = currentStep;
$('.step-item').eq(step).removeClass('active').addClass('completed');
}, 1000);
currentStep++;
}
}, 2000); // 每2秒一个步骤
}
// 停止进度模拟
function stopProgressSimulation() {
if (progressInterval) {
clearInterval(progressInterval);
progressInterval = null;
}
}
// 完成所有步骤
function completeAllSteps() {
stopProgressSimulation();
$('.step-item').removeClass('active').addClass('completed');
updateProgress('导出完成!', 100);
}
// 显示文章列表
function displayArticles(articles, totalCount) {
$('#articleCount').text(totalCount + '篇');
$('#articleList').empty();
if (articles.length === 0) {
$('#articleList').html(`
<div class="article-empty">
<i class="bi bi-inbox"></i>
<p>暂无文章数据</p>
</div>
`);
} else {
articles.forEach(function(article, index) {
const articleHtml = `
<div class="article-item">
<div class="article-item-header">
<div class="article-number">${index + 1}</div>
<div class="article-title">${escapeHtml(article['标题'] || '未知标题')}</div>
</div>
<div class="article-meta">
<div class="article-time">
<i class="bi bi-clock"></i>
${escapeHtml(article['发布时间'] || '未知')}
</div>
<div class="article-badge">
<i class="bi bi-check-circle-fill"></i>
已提取
</div>
</div>
</div>
`;
$('#articleList').append(articleHtml);
});
// 如果总数大于显示数,显示提示
if (totalCount > articles.length) {
$('#articleList').append(`
<div class="article-item" style="text-align: center; color: var(--text-secondary);">
<i class="bi bi-three-dots"></i>
还有 ${totalCount - articles.length} 篇文章未显示请下载Excel查看完整列表
</div>
`);
}
}
$('#articlePreview').fadeIn();
// 滚动到文章列表
setTimeout(function() {
$('html, body').animate({
scrollTop: $('#articlePreview').offset().top - 20
}, 500);
}, 300);
}
// HTML转义
function escapeHtml(text) {
const map = {
'&': '&amp;',
'<': '&lt;',
'>': '&gt;',
'"': '&quot;',
"'": '&#039;'
};
return text.replace(/[&<>"']/g, function(m) { return map[m]; });
}
// 显示成功消息
function showSuccess(message) {
$('#resultMessage')
.removeClass('error')
.addClass('success')
.html(`${message}`);
$('#resultBox').fadeIn();
}
// 显示错误消息
function showError(message) {
$('#resultMessage')
.removeClass('success')
.addClass('error')
.html(`${message}`);
$('#resultBox').fadeIn();
$('#downloadBtn').hide();
}
// 添加输入框焦点效果
$('#authorUrl').focus(function() {
$(this).parent().addClass('focused');
}).blur(function() {
$(this).parent().removeClass('focused');
});
// 添加到队列按钮点击事件
$('#addToQueueBtn').click(function() {
const url = $('#authorUrl').val().trim();
const months = parseFloat($('#monthsSelect').val());
const articlesOnly = $('#articlesOnlyCheckbox').is(':checked'); // 获取是否只爬取文章
// 验证URL
if (!url) {
showError('请输入百家号作者主页地址');
return;
}
if (!url.includes('baijiahao.baidu.com') || !url.includes('app_id=')) {
showError('URL格式不正确请输入完整的百家号作者主页地址');
return;
}
// 添加到队列(始终启用默认代理)
$.ajax({
url: '/api/queue/add',
type: 'POST',
contentType: 'application/json',
data: JSON.stringify({
url: url,
months: months,
use_proxy: true, // 始终启用代理
proxy_api_url: '', // 使用默认代理API
articles_only: articlesOnly // 仅爬取文章
}),
success: function(response) {
if (response.success) {
showSuccess('任务已添加到队列,系统将后台处理');
// 3秒后跳转到队列页面
setTimeout(function() {
window.location.href = '/queue';
}, 3000);
} else {
showError(response.message || '添加任务失败');
}
},
error: function(xhr) {
if (xhr.status === 401) {
alert('登录已过期,请重新登录');
window.location.href = '/login';
return;
}
showError('添加任务失败,请稍后重试');
}
});
});
});
}

83
stop.sh Normal file
View File

@@ -0,0 +1,83 @@
#!/bin/bash
###############################################################################
# 百家号爬虫系统 - 停止脚本
# 功能:停止服务及所有任务线程
###############################################################################
# 颜色定义
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
# 项目配置
PROJECT_DIR="$(cd "$(dirname "$0")" && pwd)"
APP_PORT=8030
PID_FILE="${PROJECT_DIR}/app.pid"
GUNICORN_PID_FILE="${PROJECT_DIR}/gunicorn.pid"
echo -e "${YELLOW}=========================================${NC}"
echo -e "${YELLOW} 停止百家号爬虫服务${NC}"
echo -e "${YELLOW}=========================================${NC}"
echo ""
# 通过 PID 文件停止
if [[ -f "${PID_FILE}" ]]; then
PID=$(cat ${PID_FILE})
if ps -p ${PID} > /dev/null 2>&1; then
echo "停止主进程 (PID: ${PID})..."
kill ${PID} 2>/dev/null
sleep 2
# 如果还在运行,强制终止
if ps -p ${PID} > /dev/null 2>&1; then
echo "强制终止主进程..."
kill -9 ${PID} 2>/dev/null
fi
echo -e "${GREEN}${NC} 主进程已停止"
fi
rm -f ${PID_FILE}
fi
# 停止 Gunicorn
if [[ -f "${GUNICORN_PID_FILE}" ]]; then
GUNICORN_PID=$(cat ${GUNICORN_PID_FILE})
if ps -p ${GUNICORN_PID} > /dev/null 2>&1; then
echo "停止 Gunicorn 进程 (PID: ${GUNICORN_PID})..."
kill ${GUNICORN_PID} 2>/dev/null
sleep 2
if ps -p ${GUNICORN_PID} > /dev/null 2>&1; then
echo "强制终止 Gunicorn..."
kill -9 ${GUNICORN_PID} 2>/dev/null
fi
echo -e "${GREEN}${NC} Gunicorn 进程已停止"
fi
rm -f ${GUNICORN_PID_FILE}
fi
# 清理所有 app.py 进程
APP_PIDS=$(ps aux | grep "[p]ython.*app.py" | awk '{print $2}')
if [[ -n "${APP_PIDS}" ]]; then
echo "清理所有相关进程..."
for pid in ${APP_PIDS}; do
echo " 停止进程 ${pid}..."
kill -9 ${pid} 2>/dev/null
done
echo -e "${GREEN}${NC} 所有进程已清理"
else
echo -e "${GREEN}${NC} 未发现运行中的进程"
fi
# 清理端口占用
PORT_PID=$(lsof -ti:${APP_PORT} 2>/dev/null)
if [[ -n "${PORT_PID}" ]]; then
echo "释放端口 ${APP_PORT}..."
kill -9 ${PORT_PID} 2>/dev/null
echo -e "${GREEN}${NC} 端口已释放"
fi
echo ""
echo -e "${GREEN}服务已完全停止!${NC}"
echo ""

357
task_queue.py Normal file
View File

@@ -0,0 +1,357 @@
# -*- coding: utf-8 -*-
"""
任务队列管理模块
支持离线处理、进度跟踪、结果导出
使用 SQLite 数据库存储(替代原 JSON 文件)
"""
import os
import threading
import time
from datetime import datetime
from enum import Enum
import logging
from database import get_database, migrate_from_json
logger = logging.getLogger(__name__)
class TaskStatus(Enum):
"""任务状态"""
PENDING = "pending" # 就绪(准备好了,等待工作线程)
PROCESSING = "processing" # 进行中
COMPLETED = "completed" # 完成
FAILED = "failed" # 失败
PAUSED = "paused" # 暂停(将在指定时间后自动恢复)
class TaskQueue:
"""任务队列管理器(使用 SQLite 数据库)"""
def __init__(self, queue_file="data/task_queue.json", results_dir="data/results"):
self.results_dir = results_dir
self.lock = threading.Lock()
self.db = get_database()
self._ensure_dirs()
# 从旧 JSON 文件迁移数据(只执行一次)
if os.path.exists(queue_file):
migrate_from_json(queue_file)
def _ensure_dirs(self):
"""确保必要的目录存在"""
os.makedirs(self.results_dir, exist_ok=True)
def add_task(self, url, months=6, use_proxy=False, proxy_api_url=None, username=None, articles_only=True):
"""添加新任务到队列
Args:
url: 百家号URL
months: 获取月数
use_proxy: 是否使用代理
proxy_api_url: 代理API地址
username: 用户名
articles_only: 是否仅爬取文章(跳过视频)
Returns:
task_id: 任务ID
"""
with self.lock:
task_id = f"task_{int(time.time() * 1000)}"
created_at = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
with self.db.get_connection() as conn:
cursor = conn.cursor()
cursor.execute('''
INSERT INTO tasks (
task_id, url, months, use_proxy, proxy_api_url,
username, status, created_at, progress, current_step,
total_articles, processed_articles, articles_only
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (
task_id, url, months, 1 if use_proxy else 0, proxy_api_url,
username, TaskStatus.PENDING.value, created_at, 0, "等待处理",
0, 0, 1 if articles_only else 0
))
conn.commit()
logger.info(f"添加任务: {task_id}")
return task_id
def get_task(self, task_id):
"""获取任务信息"""
with self.lock:
with self.db.get_connection() as conn:
cursor = conn.cursor()
cursor.execute(
"SELECT * FROM tasks WHERE task_id = ?",
(task_id,)
)
row = cursor.fetchone()
if row:
task = dict(row)
# 将 use_proxy 从整数转换为布尔值
task['use_proxy'] = bool(task['use_proxy'])
# 将 articles_only 从整数转换为布尔值
task['articles_only'] = bool(task.get('articles_only', 1))
return task
return None
def get_all_tasks(self, username=None):
"""获取所有任务(可按用户过滤)"""
with self.lock:
with self.db.get_connection() as conn:
cursor = conn.cursor()
if username:
cursor.execute(
"SELECT * FROM tasks WHERE username = ? ORDER BY created_at DESC",
(username,)
)
else:
cursor.execute("SELECT * FROM tasks ORDER BY created_at DESC")
rows = cursor.fetchall()
tasks = []
for row in rows:
task = dict(row)
# 将 use_proxy 从整数转换为布尔值
task['use_proxy'] = bool(task['use_proxy'])
# 将 articles_only 从整数转换为布尔值
task['articles_only'] = bool(task.get('articles_only', 1))
tasks.append(task)
return tasks
def get_pending_task(self):
"""获取下一个待处理的任务(包括检查暂停任务是否可恢复)"""
with self.lock:
with self.db.get_connection() as conn:
cursor = conn.cursor()
# 首先检查是否有暂停任务需要恢复
from datetime import datetime, timedelta
current_time = datetime.now()
cursor.execute(
"SELECT * FROM tasks WHERE status = ? ORDER BY paused_at ASC",
(TaskStatus.PAUSED.value,)
)
paused_tasks = cursor.fetchall()
for row in paused_tasks:
paused_at_str = row['paused_at']
if paused_at_str:
paused_at = datetime.strptime(paused_at_str, '%Y-%m-%d %H:%M:%S')
# 检查是否已经暂停超过10分钟
if current_time - paused_at >= timedelta(minutes=10):
task_id = row['task_id']
# 恢复任务为待处理状态(保留 last_page 和 last_ctime
cursor.execute("""
UPDATE tasks SET
status = ?,
current_step = ?,
retry_count = 0,
paused_at = NULL
WHERE task_id = ?
""", (TaskStatus.PENDING.value, "等待处理(从断点继续)", task_id))
conn.commit()
logger.info(f"任务 {task_id} 已从暂停状态恢复,将从第{row.get('last_page', 1)}页继续")
# 获取待处理任务
cursor.execute(
"SELECT * FROM tasks WHERE status = ? ORDER BY created_at ASC LIMIT 1",
(TaskStatus.PENDING.value,)
)
row = cursor.fetchone()
if row:
task = dict(row)
# 将 use_proxy 从整数转换为布尔值
task['use_proxy'] = bool(task['use_proxy'])
# 将 articles_only 从整数转换为布尔值
task['articles_only'] = bool(task.get('articles_only', 1))
return task
return None
def update_task_status(self, task_id, status, **kwargs):
"""更新任务状态
Args:
task_id: 任务ID
status: 新状态
**kwargs: 其他要更新的字段
"""
with self.lock:
status_value = status.value if isinstance(status, TaskStatus) else status
# 准备更新字段
update_fields = {"status": status_value}
# 更新时间戳
if status == TaskStatus.PROCESSING:
update_fields["started_at"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
elif status in [TaskStatus.COMPLETED, TaskStatus.FAILED]:
update_fields["completed_at"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
# 合并其他字段
update_fields.update(kwargs)
# 构建 SQL 更新语句
set_clause = ", ".join([f"{key} = ?" for key in update_fields.keys()])
values = list(update_fields.values()) + [task_id]
with self.db.get_connection() as conn:
cursor = conn.cursor()
cursor.execute(
f"UPDATE tasks SET {set_clause} WHERE task_id = ?",
values
)
conn.commit()
return cursor.rowcount > 0
def update_task_progress(self, task_id, progress, current_step=None, processed_articles=None):
"""更新任务进度
Args:
task_id: 任务ID
progress: 进度百分比 (0-100)
current_step: 当前步骤描述
processed_articles: 已处理文章数
"""
with self.lock:
update_fields = {
"progress": min(100, max(0, progress))
}
if current_step is not None:
update_fields["current_step"] = current_step
if processed_articles is not None:
update_fields["processed_articles"] = processed_articles
set_clause = ", ".join([f"{key} = ?" for key in update_fields.keys()])
values = list(update_fields.values()) + [task_id]
with self.db.get_connection() as conn:
cursor = conn.cursor()
cursor.execute(
f"UPDATE tasks SET {set_clause} WHERE task_id = ?",
values
)
conn.commit()
return cursor.rowcount > 0
def get_queue_stats(self, username=None):
"""获取队列统计信息"""
with self.lock:
with self.db.get_connection() as conn:
cursor = conn.cursor()
# 基础查询
if username:
base_query = "SELECT status, COUNT(*) as count FROM tasks WHERE username = ? GROUP BY status"
cursor.execute(base_query, (username,))
else:
base_query = "SELECT status, COUNT(*) as count FROM tasks GROUP BY status"
cursor.execute(base_query)
# 统计各状态数量
status_counts = {row["status"]: row["count"] for row in cursor.fetchall()}
# 获取总数
if username:
cursor.execute("SELECT COUNT(*) as total FROM tasks WHERE username = ?", (username,))
else:
cursor.execute("SELECT COUNT(*) as total FROM tasks")
total = cursor.fetchone()["total"]
stats = {
"total": total,
"pending": status_counts.get(TaskStatus.PENDING.value, 0),
"processing": status_counts.get(TaskStatus.PROCESSING.value, 0),
"completed": status_counts.get(TaskStatus.COMPLETED.value, 0),
"failed": status_counts.get(TaskStatus.FAILED.value, 0),
"paused": status_counts.get(TaskStatus.PAUSED.value, 0)
}
return stats
def delete_task(self, task_id):
"""删除任务(先自动终止再删除)"""
with self.lock:
with self.db.get_connection() as conn:
cursor = conn.cursor()
# 检查任务是否存在
cursor.execute("SELECT status FROM tasks WHERE task_id = ?", (task_id,))
row = cursor.fetchone()
if not row:
return False
# 如果任务还在运行,先终止
if row["status"] in [TaskStatus.PENDING.value, TaskStatus.PROCESSING.value]:
cursor.execute('''
UPDATE tasks SET
status = ?,
error = ?,
current_step = ?,
completed_at = ?
WHERE task_id = ?
''', (
TaskStatus.FAILED.value,
"任务已被用户删除",
"任务已终止",
datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
task_id
))
conn.commit()
logger.info(f"终止任务: {task_id}")
# 然后从数据库中删除
cursor.execute("DELETE FROM tasks WHERE task_id = ?", (task_id,))
conn.commit()
if cursor.rowcount > 0:
logger.info(f"删除任务: {task_id}")
return True
return False
def cancel_task(self, task_id):
"""终止任务(将等待中或处理中任务标记为失败)"""
with self.lock:
with self.db.get_connection() as conn:
cursor = conn.cursor()
# 检查任务状态
cursor.execute("SELECT status FROM tasks WHERE task_id = ?", (task_id,))
row = cursor.fetchone()
if row and row["status"] in [TaskStatus.PENDING.value, TaskStatus.PROCESSING.value]:
cursor.execute('''
UPDATE tasks SET
status = ?,
error = ?,
current_step = ?,
completed_at = ?
WHERE task_id = ?
''', (
TaskStatus.FAILED.value,
"任务已被用户终止",
"任务已终止",
datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
task_id
))
conn.commit()
return cursor.rowcount > 0
return False
# 全局队列实例
_task_queue = None
def get_task_queue():
"""获取全局任务队列实例"""
global _task_queue
if _task_queue is None:
_task_queue = TaskQueue()
return _task_queue

487
task_worker.py Normal file
View File

@@ -0,0 +1,487 @@
# -*- coding: utf-8 -*-
"""
任务处理器 - 后台并发处理队列中的任务
支持动态调整并发数,通过 SocketIO 实时推送进度和日志
"""
import threading
import time
import logging
import traceback
import psutil
from task_queue import get_task_queue, TaskStatus
logger = logging.getLogger(__name__)
# 全局变量,用于存储 socketio 实例
_socketio_instance = None
def set_socketio(socketio):
"""设置 SocketIO 实例"""
global _socketio_instance
_socketio_instance = socketio
def emit_log(task_id, message, level='info'):
"""保存日志到数据库"""
from datetime import datetime
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
# 保存到数据库
try:
from database import get_database
db = get_database()
db.add_task_log(task_id, message, level, timestamp)
except Exception as e:
logger.error(f"保存日志到数据库失败: {e}")
logger.info(f"[{task_id}] {message}")
def emit_progress(task_id, progress, current_step='', **kwargs):
"""更新任务进度"""
logger.info(f"[{task_id}] 进度: {progress}% - {current_step}")
class TaskWorker:
"""任务处理工作线程池(支持动态并发)"""
def __init__(self, min_workers=1, max_workers=3):
self.queue = get_task_queue()
self.running = False
self.min_workers = min_workers # 最小并发数
self.max_workers = max_workers # 最大并发数
self.current_workers = min_workers # 当前并发数
self.worker_threads = [] # 工作线程列表
self.processing_tasks = set() # 正在处理的任务ID集合
self.lock = threading.Lock()
def start(self):
"""启动工作线程池"""
if self.running:
logger.warning("工作线程池已经在运行")
return
self.running = True
# 启动初始工作线程
for i in range(self.min_workers):
self._start_worker(i)
# 启动动态调整线程
self.adjust_thread = threading.Thread(target=self._adjust_workers, daemon=True)
self.adjust_thread.start()
logger.info(f"任务处理器已启动(初始并发数: {self.min_workers},最大并发数: {self.max_workers}")
def _start_worker(self, worker_id):
"""启动一个工作线程"""
thread = threading.Thread(target=self._work_loop, args=(worker_id,), daemon=True)
thread.start()
with self.lock:
self.worker_threads.append(thread)
logger.info(f"工作线程 #{worker_id} 已启动")
def stop(self):
"""停止工作线程池"""
self.running = False
for thread in self.worker_threads:
if thread and thread.is_alive():
thread.join(timeout=5)
logger.info("任务处理器已停止")
def _work_loop(self, worker_id):
"""工作循环(单个线程)"""
logger.info(f"工作线程 #{worker_id} 进入循环")
while self.running:
try:
# 获取待处理任务
task = self.queue.get_pending_task()
if task:
task_id = task["task_id"]
# 检查是否已经有其他线程在处理这个任务
with self.lock:
if task_id in self.processing_tasks:
continue
self.processing_tasks.add(task_id)
try:
logger.info(f"工作线程 #{worker_id} 开始处理任务: {task_id}")
self._process_task(task, worker_id)
finally:
# 处理完成后从集合中移除
with self.lock:
self.processing_tasks.discard(task_id)
else:
# 没有任务,休息一会
time.sleep(2)
except Exception as e:
logger.error(f"工作线程 #{worker_id} 错误: {e}")
logger.error(traceback.format_exc())
time.sleep(5)
logger.info(f"工作线程 #{worker_id} 退出循环")
def _adjust_workers(self):
"""动态调整工作线程数量"""
logger.info("动态调整线程已启动")
while self.running:
try:
time.sleep(10) # 每10秒检查一次
# 获取系统资源信息
cpu_percent = psutil.cpu_percent(interval=1)
memory_percent = psutil.virtual_memory().percent
# 获取队列信息
pending_count = len([t for t in self.queue.get_all_tasks() if t.get('status') == 'pending'])
processing_count = len(self.processing_tasks)
# 决策逻辑
target_workers = self._calculate_target_workers(
pending_count,
processing_count,
cpu_percent,
memory_percent
)
# 调整线程数
if target_workers > self.current_workers:
# 增加线程
for i in range(self.current_workers, target_workers):
self._start_worker(i)
logger.info(f"增加工作线程: {self.current_workers} -> {target_workers}")
self.current_workers = target_workers
elif target_workers < self.current_workers:
# 减少线程(自然退出,不强制终止)
logger.info(f"准备减少工作线程: {self.current_workers} -> {target_workers}")
self.current_workers = target_workers
except Exception as e:
logger.error(f"调整线程数错误: {e}")
logger.error(traceback.format_exc())
time.sleep(30)
def _calculate_target_workers(self, pending_count, processing_count, cpu_percent, memory_percent):
"""计算目标线程数"""
# 基本逻辑:
# 1. 如果没有待处理任务,保持最小线程数
# 2. 如果有很多待处理任务,且系统资源充足,增加线程
# 3. 如果系统资源紧张,减少线程
if pending_count == 0:
return self.min_workers
# 系统资源紧张CPU>80% 或 内存>85%
if cpu_percent > 80 or memory_percent > 85:
logger.warning(f"系统资源紧张 (CPU: {cpu_percent}%, 内存: {memory_percent}%)")
return max(self.min_workers, self.current_workers - 1)
# 系统资源充足
if cpu_percent < 50 and memory_percent < 70:
# 根据待处理任务数决定线程数
if pending_count >= 3:
return min(self.max_workers, self.current_workers + 1)
elif pending_count >= 1:
return min(self.max_workers, max(2, self.current_workers))
# 默认保持当前线程数
return self.current_workers
def _process_task(self, task, worker_id):
"""处理单个任务"""
task_id = task["task_id"]
logger.info(f"工作线程 #{worker_id} 开始处理任务: {task_id}")
emit_log(task_id, f"任务开始处理 (Worker #{worker_id})")
# 获取当前重试次数
retry_count = task.get("retry_count", 0)
try:
# 更新状态为处理中
self.queue.update_task_status(
task_id,
TaskStatus.PROCESSING,
current_step="准备处理"
)
emit_progress(task_id, 5, "准备处理")
# 导入必要的模块
from app import BaijiahaoScraper
import pandas as pd
import os
# 步骤1: 解析URL获取UK
self.queue.update_task_progress(task_id, 10, "解析URL获取UK")
emit_progress(task_id, 10, "解析URL获取UK")
emit_log(task_id, f"URL: {task['url']}")
url = task["url"]
use_proxy = task.get("use_proxy", False)
proxy_api_url = task.get("proxy_api_url")
articles_only = task.get("articles_only", True) # 获取是否仅爬取文章
if use_proxy:
emit_log(task_id, "已启用代理IP池", "info")
if articles_only:
emit_log(task_id, "已启用文章过滤(跳过视频内容)", "info")
# 提取app_id
import re
app_id_match = re.search(r'app_id=([^&\s]+)', url)
if not app_id_match:
raise Exception("无法从 URL 中提取 app_id")
app_id = app_id_match.group(1)
emit_log(task_id, f"解析到 app_id: {app_id}")
# 获取UK
emit_log(task_id, "正在获取用户 UK...")
try:
uk, cookies = BaijiahaoScraper.get_uk_from_app_id(
app_id,
use_proxy=use_proxy,
proxy_api_url=proxy_api_url
)
emit_log(task_id, f"成功获取 UK: {uk[:20]}...")
except Exception as uk_error:
emit_log(task_id, f"获取UK失败: {str(uk_error)}", "error")
raise
# 步骤2: 初始化爬虫
self.queue.update_task_progress(task_id, 20, "初始化爬虫")
emit_progress(task_id, 20, "初始化爬虫")
emit_log(task_id, "初始化爬虫实例...")
scraper = BaijiahaoScraper(
uk=uk,
cookies=cookies,
use_proxy=use_proxy,
proxy_api_url=proxy_api_url
)
# 步骤3: 获取文章列表
months = task.get("months", 6)
# 检查是否有断点续传数据
last_page = task.get("last_page", 0)
last_ctime = task.get("last_ctime")
start_page = 1
start_ctime = None
if last_page > 0 and last_ctime:
# 断点续传
start_page = last_page
start_ctime = last_ctime
emit_log(task_id, f"🔄 检测到断点数据,从第{start_page}页继续爬取", "info")
# 检查缓存中是否有数据
from database import get_database
db = get_database()
cached_count = db.get_cached_article_count(task_id)
if cached_count > 0:
emit_log(task_id, f"💾 已缓存 {cached_count} 篇文章,将继续爬取...", "info")
else:
# 新任务,清除之前的缓存(如果有)
from database import get_database
db = get_database()
db.clear_article_cache(task_id)
self.queue.update_task_progress(task_id, 30, f"获取文章列表(近{months}个月)")
emit_progress(task_id, 30, f"获取文章列表(近{months}个月)")
emit_log(task_id, f"开始获取近 {months} 个月的文章...")
emit_log(task_id, "提示抓取过程较慢8-12秒/页),请耐心等待...", "info")
emit_log(task_id, "系统正在使用代理IP池抓取数据过程中会自动切换IP应对反爬...", "info")
# 定义保存回调函数:每页数据立即保存到数据库
from database import get_database
db = get_database()
def save_page_to_db(page, articles, ctime):
"""保存每页数据到数据库缓存"""
if articles:
db.save_articles_batch(task_id, articles, page)
emit_log(task_id, f"💾 第{page}页数据已保存,{len(articles)}篇文章", "success")
# 更新任务的断点信息
self.queue.update_task_status(
task_id,
TaskStatus.PROCESSING,
last_page=page,
last_ctime=ctime
)
# 更新进度粗略估计30-80%区间)
total_cached = db.get_cached_article_count(task_id)
progress = min(30 + int(page * 2), 80) # 每页增加2%最多80%
self.queue.update_task_progress(
task_id,
progress,
f"正在抓取第{page}页...",
processed_articles=total_cached
)
emit_progress(task_id, progress, f"正在抓取第{page}页...", processed_articles=total_cached)
# 调用 get_articles传入回调函数和断点参数
result = scraper.get_articles(
months=months,
app_id=app_id,
articles_only=articles_only,
task_id=task_id,
on_page_fetched=save_page_to_db,
start_page=start_page,
start_ctime=start_ctime
)
if not result or not result.get('completed'):
# 未完成,保留断点信息以便续传
raise Exception(f"抓取未完成,已保存到第{result.get('last_page', start_page)}")
# 从数据库读取全部缓存文章
articles = db.get_cached_articles(task_id)
if not articles:
raise Exception("未获取到文章数据")
# 更新总文章数
total = len(articles)
self.queue.update_task_status(
task_id,
TaskStatus.PROCESSING,
total_articles=total
)
emit_log(task_id, f"成功获取 {total} 篇文章", "success")
# 步骤4: 生成Excel直接使用数据库中的数据
self.queue.update_task_progress(task_id, 90, "生成Excel文件")
emit_progress(task_id, 90, "生成Excel文件")
emit_log(task_id, "正在生成 Excel 文件...")
df = pd.DataFrame(articles)
# 生成文件名
timestamp = time.strftime("%Y%m%d_%H%M%S")
filename = f"百家号文章_{app_id}_{timestamp}.xlsx"
result_file = os.path.join(self.queue.results_dir, filename)
# 保存Excel
with pd.ExcelWriter(result_file, engine='openpyxl') as writer:
df.to_excel(writer, index=False, sheet_name='文章列表')
# 调整列宽
worksheet = writer.sheets['文章列表']
worksheet.column_dimensions['A'].width = 80 # 标题列
worksheet.column_dimensions['B'].width = 20 # 时间列
emit_log(task_id, f"Excel 文件已生成: {filename}")
# 清除缓存数据(任务已完成)
db.clear_article_cache(task_id)
emit_log(task_id, "🗑️ 已清除缓存数据", "info")
# 步骤5: 完成
self.queue.update_task_status(
task_id,
TaskStatus.COMPLETED,
progress=100,
current_step="处理完成",
result_file=filename,
processed_articles=total
)
emit_progress(task_id, 100, "处理完成", result_file=filename)
emit_log(task_id, f"✅ 任务完成!导出 {total} 篇文章", "success")
logger.info(f"工作线程 #{worker_id} 任务完成: {task_id}, 导出 {total} 篇文章")
except Exception as e:
error_msg = str(e)
logger.error(f"工作线程 #{worker_id} 任务失败: {task_id}, 错误: {error_msg}")
# 记录详细错误堆栈
error_traceback = traceback.format_exc()
logger.error(error_traceback)
# 将错误堆栈也推送到前端(分行推送)
emit_log(task_id, f"❌ 任务失败: {error_msg}", "error")
# 推送错误详情(每行作为独立日志)
for line in error_traceback.split('\n'):
if line.strip():
emit_log(task_id, line, "error")
# 判断是否需要重试或暂停
retry_count += 1
# 检查是否有缓存数据(如果有,说明部分成功)
from database import get_database
db = get_database()
cached_count = db.get_cached_article_count(task_id)
# 如果已经有缓存数据,说明部分成功,增加重试次数
max_retries = 10 if cached_count > 0 else 3 # 有缓存时允许10次重试
# 如果连续失败超过上限暂停任务10分钟
if retry_count >= max_retries:
from datetime import datetime
paused_at = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
if cached_count > 0:
emit_log(task_id, f"⚠️ 连续失败{retry_count}次,已缓存{cached_count}篇文章10分钟后继续尝试", "warning")
else:
emit_log(task_id, f"⚠️ 连续失败{retry_count}任务将暂停10分钟后自动重试", "warning")
self.queue.update_task_status(
task_id,
TaskStatus.PAUSED,
error=error_msg,
last_error=error_msg,
retry_count=retry_count,
paused_at=paused_at,
current_step=f"暂停中10分钟后重试 - 错误: {error_msg}"
)
emit_progress(task_id, 0, f"暂停中10分钟后重试")
logger.warning(f"任务 {task_id} 已暂停,将在 {paused_at} 后10分钟恢复已缓存{cached_count}")
else:
# 重试次数未达到上限,标记为待处理状态,等待下次重试
if cached_count > 0:
emit_log(task_id, f"⚠️ 任务失败,将进行第{retry_count + 1}次重试(已缓存{cached_count}篇)", "warning")
else:
emit_log(task_id, f"⚠️ 任务失败,将进行第{retry_count + 1}次重试", "warning")
self.queue.update_task_status(
task_id,
TaskStatus.PENDING,
error=error_msg,
last_error=error_msg,
retry_count=retry_count,
current_step=f"等待重试 (已失败{retry_count}次) - {error_msg}"
)
emit_progress(task_id, 0, f"等待重试")
# 全局工作线程
_worker = None
def get_task_worker():
"""获取全局任务处理器实例"""
global _worker
if _worker is None:
_worker = TaskWorker()
return _worker
def start_task_worker():
"""启动任务处理器"""
worker = get_task_worker()
worker.start()
def stop_task_worker():
"""停止任务处理器"""
worker = get_task_worker()
worker.stop()

218
taskworker_monitor.py Normal file
View File

@@ -0,0 +1,218 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
TaskWorker 自动监控和恢复守护进程
用于生产环境中自动检测和修复任务卡住的问题
"""
import os
import sys
import time
import logging
import signal
import threading
from datetime import datetime
# 配置日志
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s [%(levelname)s] %(message)s',
handlers=[
logging.FileHandler('logs/taskworker_monitor.log'),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
class TaskWorkerMonitor:
"""TaskWorker 监控器"""
def __init__(self, check_interval=60):
"""
Args:
check_interval: 检查间隔默认60秒
"""
self.check_interval = check_interval
self.running = False
self.monitor_thread = None
def check_worker_status(self):
"""检查 TaskWorker 状态"""
try:
from task_worker import get_task_worker
from task_queue import get_task_queue
worker = get_task_worker()
queue = get_task_queue()
# 获取任务统计
tasks = queue.get_all_tasks()
pending_count = len([t for t in tasks if t.get('status') == 'pending'])
processing_count = len([t for t in tasks if t.get('status') == 'processing'])
# 检查 worker 状态
is_running = worker.running
alive_threads = sum(1 for t in worker.worker_threads if t and t.is_alive())
logger.info(f"状态检查 - 运行:{is_running} 活跃线程:{alive_threads} "
f"待处理:{pending_count} 处理中:{processing_count}")
# 判断是否需要修复
need_fix = False
reason = ""
if not is_running:
need_fix = True
reason = "TaskWorker 未运行"
elif alive_threads == 0 and pending_count > 0:
need_fix = True
reason = f"{pending_count} 个待处理任务,但没有活跃线程"
elif processing_count > 0:
# 检查处理中的任务是否长时间未更新
# 这里可以添加更复杂的逻辑
pass
return need_fix, reason, {
'running': is_running,
'alive_threads': alive_threads,
'pending_count': pending_count,
'processing_count': processing_count
}
except Exception as e:
logger.error(f"检查状态失败: {e}")
import traceback
logger.error(traceback.format_exc())
return True, f"检查失败: {e}", {}
def restart_worker(self):
"""重启 TaskWorker"""
logger.warning("正在重启 TaskWorker...")
try:
from task_worker import get_task_worker
worker = get_task_worker()
# 停止现有 worker
if worker.running:
logger.info("停止现有 TaskWorker...")
worker.stop()
time.sleep(2)
# 启动新的 worker
logger.info("启动新的 TaskWorker...")
worker.start()
time.sleep(2)
# 验证启动状态
if worker.running:
alive_threads = sum(1 for t in worker.worker_threads if t and t.is_alive())
logger.info(f"✅ TaskWorker 重启成功,活跃线程: {alive_threads}")
return True
else:
logger.error("❌ TaskWorker 重启后未运行")
return False
except Exception as e:
logger.error(f"重启 TaskWorker 失败: {e}")
import traceback
logger.error(traceback.format_exc())
return False
def monitor_loop(self):
"""监控循环"""
logger.info(f"监控循环启动,检查间隔: {self.check_interval}")
consecutive_failures = 0
max_consecutive_failures = 3
while self.running:
try:
# 检查状态
need_fix, reason, status = self.check_worker_status()
if need_fix:
logger.warning(f"⚠️ 检测到问题: {reason}")
logger.info(f"状态详情: {status}")
# 尝试修复
if self.restart_worker():
logger.info("✅ 自动修复成功")
consecutive_failures = 0
else:
consecutive_failures += 1
logger.error(f"❌ 自动修复失败 (连续失败 {consecutive_failures} 次)")
if consecutive_failures >= max_consecutive_failures:
logger.critical(f"连续修复失败 {consecutive_failures} 次,请人工介入!")
# 可以在这里发送告警通知
else:
consecutive_failures = 0
# 等待下次检查
time.sleep(self.check_interval)
except Exception as e:
logger.error(f"监控循环错误: {e}")
import traceback
logger.error(traceback.format_exc())
time.sleep(self.check_interval)
logger.info("监控循环已停止")
def start(self):
"""启动监控"""
if self.running:
logger.warning("监控已在运行")
return
self.running = True
self.monitor_thread = threading.Thread(target=self.monitor_loop, daemon=True)
self.monitor_thread.start()
logger.info("TaskWorker 监控器已启动")
def stop(self):
"""停止监控"""
self.running = False
if self.monitor_thread:
self.monitor_thread.join(timeout=5)
logger.info("TaskWorker 监控器已停止")
def signal_handler(signum, frame):
"""信号处理器"""
logger.info(f"收到信号 {signum},正在停止...")
if monitor:
monitor.stop()
sys.exit(0)
if __name__ == '__main__':
# 创建日志目录
os.makedirs('logs', exist_ok=True)
# 注册信号处理器
signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)
# 创建监控器
monitor = TaskWorkerMonitor(check_interval=60) # 每60秒检查一次
print("=" * 60)
print("TaskWorker 自动监控守护进程")
print("=" * 60)
print(f"检查间隔: {monitor.check_interval}")
print("按 Ctrl+C 停止")
print("=" * 60)
# 启动监控
monitor.start()
# 保持运行
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
logger.info("用户中断")
monitor.stop()

292
templates/index.html Normal file
View File

@@ -0,0 +1,292 @@
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>文章导出 - 百家号管理系统</title>
<!-- Bootstrap Icons -->
<link rel="stylesheet" href="/static/css/icons-local.css">
<link rel="stylesheet" href="/static/css/style.css">
</head>
<body>
<!-- 主布局容器 -->
<div class="app-container">
<!-- 左侧菜单栏 -->
<aside class="sidebar">
<!-- Logo区域 -->
<div class="sidebar-logo">
<div class="sidebar-logo-icon">
<i class="bi bi-cloud-download"></i>
</div>
<div class="sidebar-logo-text">
<div class="sidebar-logo-title">百家号工具</div>
<div class="sidebar-logo-subtitle">文章导出系统</div>
</div>
</div>
<!-- 菜单导航 -->
<nav class="sidebar-nav">
<ul class="nav-menu">
<li class="nav-item">
<a href="/" class="nav-link active">
<i class="bi bi-download"></i>
<span>文章导出</span>
</a>
</li>
<li class="nav-item">
<a href="/queue" class="nav-link">
<i class="bi bi-list-task"></i>
<span>任务队列</span>
<span class="nav-badge" id="queueBadge" style="display: none;">0</span>
</a>
</li>
</ul>
</nav>
<!-- 用户信息区域 -->
<div class="sidebar-user">
<div class="user-info-card">
<div class="user-avatar">
<i class="bi bi-person-fill"></i>
</div>
<div class="user-details">
<div class="user-name">{{ username }}</div>
<div class="user-role">管理员</div>
</div>
<button class="logout-btn" id="logoutBtn" title="登出">
<i class="bi bi-box-arrow-right"></i>
</button>
</div>
</div>
</aside>
<!-- 主内容区域 -->
<main class="main-content">
<!-- 顶部导航栏 -->
<header class="top-navbar">
<h1 class="navbar-title">
<i class="bi bi-file-earmark-arrow-down"></i>
文章导出
</h1>
</header>
<!-- 内容区域 -->
<div class="content-area">
<!-- 页面头部 -->
<div class="page-header">
<h2 class="page-title">
<i class="bi bi-newspaper"></i>
百家号文章导出
</h2>
<p class="page-description">输入百家号作者主页链接,导出指定时间范围内的文章信息</p>
</div>
<!-- 输入表单卡片 -->
<div class="card">
<div class="card-body">
<div class="form-group">
<label class="form-label">
<i class="bi bi-link-45deg label-icon"></i>
百家号作者主页地址
</label>
<input
type="text"
id="authorUrl"
class="form-input"
value="https://baijiahao.baidu.com/u?app_id=1700253559210167"
placeholder="例如https://baijiahao.baidu.com/u?app_id=1700253559210167"
>
<div class="input-hint">
<i class="bi bi-info-circle"></i>
请输入完整的百家号作者主页URL地址
</div>
</div>
<div class="form-group">
<label class="form-label">
<i class="bi bi-cookie label-icon"></i>
Cookie (可选,如果出现登录页面请填写)
</label>
<textarea
id="cookieInput"
class="form-input form-textarea"
rows="3"
placeholder="如果需要登录请粘贴浏览器中的Cookie"
></textarea>
<div class="input-hint">
<i class="bi bi-info-circle"></i>
获取方法:打开百家号 → F12开发者工具 → Network → 刷新页面 → 点击任意请求 → 复制Request Headers中的Cookie
</div>
</div>
<div class="form-group">
<label class="form-label">
<i class="bi bi-calendar-range label-icon"></i>
时间范围
</label>
<select id="monthsSelect" class="form-select">
<option value="0.15" selected>近5天</option>
<option value="0.33">近10天</option>
<option value="1">近1个月</option>
<option value="6">近6个月</option>
<option value="12">近12个月</option>
</select>
</div>
<div class="form-group">
<label class="form-label">
<i class="bi bi-filter label-icon"></i>
内容过滤
</label>
<div class="checkbox-group">
<label class="checkbox-label">
<input type="checkbox" id="articlesOnlyCheckbox" checked>
<span class="checkbox-text">仅爬取文章(跳过视频内容)</span>
</label>
</div>
<div class="input-hint">
<i class="bi bi-info-circle"></i>
勾选后将过滤掉所有视频类型的内容,只保留文章
</div>
</div>
</div>
</div>
<!-- 信息卡片 -->
<div class="card info-card">
<div class="card-body">
<div class="info-grid">
<div class="info-item">
<i class="bi bi-file-earmark-spreadsheet info-icon"></i>
<div class="info-content">
<div class="info-label">导出格式</div>
<div class="info-value">Excel (.xlsx)</div>
</div>
</div>
<div class="info-item">
<i class="bi bi-card-list info-icon"></i>
<div class="info-content">
<div class="info-label">导出内容</div>
<div class="info-value">文章标题、发布时间</div>
</div>
</div>
</div>
</div>
</div>
<!-- 操作按钮 -->
<div style="display: flex; gap: 10px;">
<button id="exportBtn" class="btn btn-primary" style="flex: 1;">
<i class="bi bi-download"></i>
即时导出
</button>
<button id="addToQueueBtn" class="btn btn-secondary" style="flex: 1;">
<i class="bi bi-plus-circle"></i>
添加到队列
</button>
</div>
<!-- 加载状态 -->
<div id="loadingBox" class="loading-box" style="display: none;">
<div class="loading-spinner"></div>
<p class="loading-text">正在获取文章数据,请稍候...</p>
<!-- 进度详情 -->
<div id="progressDetails" class="progress-details" style="display: none;">
<div class="progress-bar-container">
<div id="progressBar" class="progress-bar"></div>
</div>
<div class="progress-info">
<span id="progressMessage" class="progress-message">初始化...</span>
<span id="progressPercent" class="progress-percent">0%</span>
</div>
<div id="progressSteps" class="progress-steps">
<div class="step-item">
<i class="bi bi-1-circle"></i>
<span>解析URL</span>
</div>
<div class="step-item">
<i class="bi bi-2-circle"></i>
<span>启动浏览器</span>
</div>
<div class="step-item">
<i class="bi bi-3-circle"></i>
<span>加载页面</span>
</div>
<div class="step-item">
<i class="bi bi-4-circle"></i>
<span>滚动获取</span>
</div>
<div class="step-item">
<i class="bi bi-5-circle"></i>
<span>提取数据</span>
</div>
<div class="step-item">
<i class="bi bi-6-circle"></i>
<span>生成Excel</span>
</div>
</div>
</div>
</div>
<!-- 结果显示 -->
<div id="resultBox" class="result-box" style="display: none;">
<div id="resultMessage" class="result-message"></div>
<button id="downloadBtn" class="btn btn-success" style="display: none;">
<i class="bi bi-file-arrow-down"></i>
下载Excel文件
</button>
</div>
<!-- 文章预览列表 -->
<div id="articlePreview" class="card article-preview" style="display: none;">
<div class="card-header">
<i class="bi bi-list-ul"></i>
已提取文章列表
<span id="articleCount" class="article-count">0篇</span>
</div>
<div class="card-body">
<div id="articleList" class="article-list"></div>
</div>
</div>
<!-- 使用说明卡片 -->
<div class="card">
<div class="card-header">
<i class="bi bi-info-circle"></i>
使用说明
</div>
<div class="card-body">
<div class="info-steps">
<div class="info-step">
<div class="step-number">1</div>
<div class="step-content">
<h4>复制URL</h4>
<p>在浏览器中打开百家号作者主页复制完整的URL地址</p>
</div>
</div>
<div class="info-step">
<div class="step-number">2</div>
<div class="step-content">
<h4>配置参数</h4>
<p>选择时间范围和代理设置如需要可填写Cookie</p>
</div>
</div>
<div class="info-step">
<div class="step-number">3</div>
<div class="step-content">
<h4>导出文章</h4>
<p>点击“即时导出”或“添加到队列”,等待处理完成</p>
</div>
</div>
</div>
</div>
</div>
</div>
</main>
</div>
<!-- jQuery - 使用国内CDN -->
<script src="/static/js/jquery.min.js"></script>
<script src="/static/js/main.js"></script>
</body>
</html>

382
templates/login.html Normal file
View File

@@ -0,0 +1,382 @@
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>登录 - 百家号文章导出工具</title>
<!-- Bootstrap Icons - 使用本地CSS + CDN字体 -->
<link rel="stylesheet" href="/static/css/icons-local.css">
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
:root {
--primary-color: #0052D9;
--primary-hover: #003DA6;
--bg-gradient-start: #f5f7fa;
--bg-gradient-end: #e8eef5;
--card-bg: #ffffff;
--text-primary: #1f2937;
--text-secondary: #6b7280;
--border-color: #e5e7eb;
--success-color: #10b981;
--error-color: #ef4444;
--shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06);
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
background: linear-gradient(135deg, var(--bg-gradient-start) 0%, var(--bg-gradient-end) 100%);
min-height: 100vh;
display: flex;
align-items: center;
justify-content: center;
padding: 20px;
}
.login-container {
width: 100%;
max-width: 420px;
}
.login-card {
background: var(--card-bg);
border-radius: 16px;
box-shadow: var(--shadow);
padding: 40px;
animation: fadeInUp 0.5s ease-out;
}
@keyframes fadeInUp {
from {
opacity: 0;
transform: translateY(20px);
}
to {
opacity: 1;
transform: translateY(0);
}
}
.login-header {
text-align: center;
margin-bottom: 32px;
}
.login-icon {
width: 64px;
height: 64px;
background: linear-gradient(135deg, var(--primary-color) 0%, var(--primary-hover) 100%);
border-radius: 50%;
display: flex;
align-items: center;
justify-content: center;
margin: 0 auto 16px;
font-size: 32px;
color: white;
}
.login-title {
font-size: 24px;
font-weight: 600;
color: var(--text-primary);
margin-bottom: 8px;
}
.login-subtitle {
font-size: 14px;
color: var(--text-secondary);
}
.form-group {
margin-bottom: 20px;
}
.form-label {
display: block;
font-size: 14px;
font-weight: 500;
color: var(--text-primary);
margin-bottom: 8px;
}
.input-wrapper {
position: relative;
}
.input-icon {
position: absolute;
left: 12px;
top: 50%;
transform: translateY(-50%);
color: var(--text-secondary);
font-size: 18px;
}
.form-input {
width: 100%;
padding: 12px 12px 12px 40px;
font-size: 14px;
border: 1px solid var(--border-color);
border-radius: 8px;
outline: none;
transition: all 0.3s;
}
.form-input:focus {
border-color: var(--primary-color);
box-shadow: 0 0 0 3px rgba(0, 82, 217, 0.1);
}
.btn-login {
width: 100%;
padding: 14px;
font-size: 16px;
font-weight: 500;
color: white;
background: linear-gradient(135deg, var(--primary-color) 0%, var(--primary-hover) 100%);
border: none;
border-radius: 8px;
cursor: pointer;
transition: all 0.3s;
margin-top: 24px;
}
.btn-login:hover {
transform: translateY(-2px);
box-shadow: 0 10px 15px -3px rgba(0, 82, 217, 0.3);
}
.btn-login:active {
transform: translateY(0);
}
.btn-login:disabled {
opacity: 0.6;
cursor: not-allowed;
transform: none;
}
.alert {
padding: 12px 16px;
border-radius: 8px;
font-size: 14px;
margin-bottom: 20px;
display: none;
animation: slideDown 0.3s ease-out;
}
@keyframes slideDown {
from {
opacity: 0;
transform: translateY(-10px);
}
to {
opacity: 1;
transform: translateY(0);
}
}
.alert-success {
background: #d1fae5;
color: #065f46;
border: 1px solid #6ee7b7;
}
.alert-error {
background: #fee2e2;
color: #991b1b;
border: 1px solid #fca5a5;
}
.divider {
text-align: center;
margin: 24px 0;
position: relative;
}
.divider::before {
content: '';
position: absolute;
top: 50%;
left: 0;
right: 0;
height: 1px;
background: var(--border-color);
}
.divider-text {
display: inline-block;
background: var(--card-bg);
padding: 0 16px;
color: var(--text-secondary);
font-size: 14px;
position: relative;
}
.info-box {
background: #eff6ff;
border: 1px solid #bfdbfe;
border-radius: 8px;
padding: 12px 16px;
margin-top: 20px;
}
.info-box p {
margin: 4px 0;
font-size: 13px;
color: #1e40af;
}
.info-box strong {
font-weight: 600;
}
.footer {
text-align: center;
margin-top: 24px;
color: var(--text-secondary);
font-size: 14px;
}
</style>
</head>
<body>
<div class="login-container">
<div class="login-card">
<div class="login-header">
<div class="login-icon">
<i class="bi bi-shield-lock-fill"></i>
</div>
<h1 class="login-title">欢迎登录</h1>
<p class="login-subtitle">百家号文章导出工具</p>
</div>
<div id="alertBox" class="alert"></div>
<form id="loginForm">
<div class="form-group">
<label class="form-label">用户名</label>
<div class="input-wrapper">
<i class="bi bi-person-fill input-icon"></i>
<input
type="text"
id="username"
class="form-input"
placeholder="请输入用户名"
autocomplete="username"
required
>
</div>
</div>
<div class="form-group">
<label class="form-label">密码</label>
<div class="input-wrapper">
<i class="bi bi-key-fill input-icon"></i>
<input
type="password"
id="password"
class="form-input"
placeholder="请输入密码"
autocomplete="current-password"
required
>
</div>
</div>
<button type="submit" id="loginBtn" class="btn-login">
<i class="bi bi-box-arrow-in-right"></i>
登录
</button>
</form>
</div>
<div class="footer">
<p>© 2025 百家号文章导出工具 | 仅供学习交流使用</p>
</div>
</div>
<script src="/static/js/jquery.min.js"></script>
<script>
// 检查jQuery是否加载
if (typeof jQuery === 'undefined') {
console.error('jQuery未加载请检查网络连接');
alert('jQuery加载失败请刷新页面或检查网络连接');
} else {
$(document).ready(function() {
// 登录表单提交
$('#loginForm').submit(function(e) {
e.preventDefault();
const username = $('#username').val().trim();
const password = $('#password').val().trim();
if (!username || !password) {
showAlert('请输入用户名和密码', 'error');
return;
}
// 禁用按钮
$('#loginBtn').prop('disabled', true).html('<i class="bi bi-hourglass-split"></i> 登录中...');
// 发送登录请求
$.ajax({
url: '/api/login',
type: 'POST',
contentType: 'application/json',
data: JSON.stringify({
username: username,
password: password
}),
success: function(response) {
if (response.success) {
showAlert('登录成功,正在跳转...', 'success');
setTimeout(function() {
window.location.href = '/';
}, 1000);
} else {
showAlert(response.message || '登录失败', 'error');
$('#loginBtn').prop('disabled', false).html('<i class="bi bi-box-arrow-in-right"></i> 登录');
}
},
error: function(xhr, status, error) {
let errorMessage = '登录失败,请稍后重试';
if (xhr.responseJSON && xhr.responseJSON.message) {
errorMessage = xhr.responseJSON.message;
}
showAlert(errorMessage, 'error');
$('#loginBtn').prop('disabled', false).html('<i class="bi bi-box-arrow-in-right"></i> 登录');
}
});
});
// 显示提示信息
function showAlert(message, type) {
const alertBox = $('#alertBox');
alertBox.removeClass('alert-success alert-error');
alertBox.addClass('alert-' + type);
alertBox.text(message);
alertBox.show();
// 3秒后自动隐藏
if (type === 'error') {
setTimeout(function() {
alertBox.fadeOut();
}, 3000);
}
}
// 回车登录
$('#username, #password').keypress(function(e) {
if (e.which === 13) {
$('#loginForm').submit();
}
});
});
}
</script>
</body>
</html>

1508
templates/queue.html Normal file

File diff suppressed because it is too large Load Diff

526
test2.py Normal file
View File

@@ -0,0 +1,526 @@
import json
import random
import time
from typing import Dict, Any, Optional
import logging
from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeoutError
from fake_useragent import UserAgent
import requests
import re
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class BaiduBJHSpider:
def __init__(self, use_proxy: bool = False):
self.ua = UserAgent()
self.use_proxy = use_proxy
self.proxies = [] # 如果需要代理,这里填你的代理列表
self.session_cookie = None
self.session = requests.Session()
# 设置请求超时和重试
self.session.mount('http://', requests.adapters.HTTPAdapter(max_retries=3))
self.session.mount('https://', requests.adapters.HTTPAdapter(max_retries=3))
def init_browser(self, timeout: int = 15000):
"""初始化浏览器环境获取Cookie"""
playwright = sync_playwright().start()
try:
# 配置浏览器参数
browser_args = [
'--disable-blink-features=AutomationControlled',
'--disable-web-security',
'--disable-features=IsolateOrigins,site-per-process',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-gpu',
]
# 启动浏览器
browser = playwright.chromium.launch(
headless=True, # 改为True无头模式更快
args=browser_args,
timeout=timeout
)
# 创建上下文
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent=self.ua.random,
locale='zh-CN',
timezone_id='Asia/Shanghai',
# 设置超时
navigation_timeout=timeout,
java_script_enabled=True,
bypass_csp=True
)
# 设置额外的HTTP头
context.set_extra_http_headers({
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
})
page = context.new_page()
# 1. 首先访问百度首页获取基础Cookie
logger.info("访问百度首页...")
try:
page.goto('https://www.baidu.com', wait_until='domcontentloaded', timeout=10000)
time.sleep(random.uniform(1, 2))
except PlaywrightTimeoutError:
logger.warning("百度首页加载超时,继续执行...")
# 2. 访问百家号页面
logger.info("访问百家号页面...")
try:
# 使用更宽松的等待条件
page.goto('https://baijiahao.baidu.com/',
wait_until='domcontentloaded', # 改为domcontentloaded更快
timeout=10000)
time.sleep(random.uniform(2, 3))
except PlaywrightTimeoutError:
logger.warning("百家号页面加载超时,尝试继续...")
# 即使超时也尝试获取Cookie
# 获取Cookie
cookies = context.cookies()
self.session_cookie = '; '.join([f"{c['name']}={c['value']}" for c in cookies])
# 将Cookie添加到requests session中
for cookie in cookies:
self.session.cookies.set(cookie['name'], cookie['value'])
if cookies:
logger.info(f"成功获取到 {len(cookies)} 个Cookie")
else:
logger.warning("未获取到Cookie")
browser.close()
return cookies
except Exception as e:
logger.error(f"初始化浏览器失败: {e}")
return None
finally:
playwright.stop()
def build_headers(self, referer: str = "https://baijiahao.baidu.com/") -> Dict:
"""构建请求头"""
headers = {
'User-Agent': self.ua.random,
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'Accept-Encoding': 'gzip, deflate',
'Referer': referer,
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
if self.session_cookie:
headers['Cookie'] = self.session_cookie
return headers
def generate_callback_name(self) -> str:
"""生成随机的callback函数名"""
timestamp = int(time.time() * 1000)
return f"__jsonp{timestamp}"
def fetch_data_directly(self, uk: str = "ntHidnLhrlfclJar2z8wBg") -> Optional[Dict]:
"""直接请求接口(可能需要多次尝试)"""
# 先初始化浏览器获取Cookie
logger.info("初始化浏览器获取Cookie...")
cookies = self.init_browser()
if not cookies:
logger.warning("未能获取到Cookie尝试继续请求...")
for attempt in range(3): # 尝试3次
try:
callback_name = self.generate_callback_name()
timestamp = int(time.time() * 1000)
# 构建URL参数 - 使用更简单的参数
params = {
'tab': 'main',
'num': '10',
'uk': uk,
'source': 'pc',
'type': 'newhome',
'action': 'dynamic',
'format': 'jsonp',
'callback': callback_name,
'_': str(timestamp) # 时间戳参数
}
url = "https://mbd.baidu.com/webpage"
headers = self.build_headers()
logger.info(f"尝试第{attempt + 1}次请求...")
# 随机延迟
time.sleep(random.uniform(1, 2))
# 设置代理(如果需要)
proxies = None
if self.use_proxy and self.proxies:
proxy = random.choice(self.proxies)
proxies = {
'http': proxy,
'https': proxy
}
response = self.session.get(
url,
params=params,
headers=headers,
timeout=15, # 缩短超时时间
proxies=proxies
)
# 提取JSONP数据
text = response.text
if text.startswith(callback_name + '(') and text.endswith(')'):
json_str = text[len(callback_name) + 1:-1]
data = json.loads(json_str)
logger.info(f"成功获取JSON数据")
return data
else:
# 尝试直接解析为JSON可能是JSON格式
try:
data = json.loads(text)
logger.info("直接解析JSON成功")
return data
except:
pass
except requests.exceptions.Timeout:
logger.error(f"请求超时 (尝试{attempt + 1})")
except Exception as e:
logger.error(f"请求失败 (尝试{attempt + 1}): {e}")
# 等待后重试
if attempt < 2: # 如果不是最后一次尝试
time.sleep(random.uniform(2, 3))
return None
def fetch_via_browser(self, uk: str = "ntHidnLhrlfclJar2z8wBg", timeout: int = 15000) -> Optional[Dict]:
"""通过浏览器直接执行获取数据"""
playwright = sync_playwright().start()
try:
browser = playwright.chromium.launch(
headless=True, # 无头模式
args=[
'--disable-blink-features=AutomationControlled',
'--no-sandbox',
'--disable-dev-shm-usage'
],
timeout=timeout
)
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent=self.ua.random,
locale='zh-CN',
navigation_timeout=timeout
)
page = context.new_page()
# 监听网络请求
results = []
def handle_response(response):
url = response.url
if "mbd.baidu.com/webpage" in url and "format=jsonp" in url:
try:
# 获取响应文本
text = response.text()
logger.info(f"捕获到请求: {url}")
# 从URL提取callback名称
import urllib.parse
parsed_url = urllib.parse.urlparse(url)
query_params = urllib.parse.parse_qs(parsed_url.query)
if 'callback' in query_params:
callback = query_params['callback'][0]
if text.startswith(callback + '(') and text.endswith(')'):
json_str = text[len(callback) + 1:-1]
data = json.loads(json_str)
results.append(data)
logger.info("成功解析JSONP数据")
except Exception as e:
logger.debug(f"处理响应失败: {e}")
page.on("response", handle_response)
# 访问百家号页面
target_url = f"https://baijiahao.baidu.com/u?app_id={uk}"
logger.info(f"访问页面: {target_url}")
try:
page.goto(target_url, wait_until='domcontentloaded', timeout=10000)
time.sleep(random.uniform(2, 3))
# 简单滚动
page.evaluate("window.scrollBy(0, 500)")
time.sleep(1)
page.evaluate("window.scrollBy(0, 500)")
time.sleep(1)
# 等待数据加载
time.sleep(2)
except PlaywrightTimeoutError:
logger.warning("页面加载超时,继续处理已捕获的数据...")
browser.close()
if results:
logger.info(f"通过浏览器捕获到 {len(results)} 个结果")
return results[0]
except Exception as e:
logger.error(f"浏览器方式获取失败: {e}")
finally:
playwright.stop()
return None
def fetch_with_ajax(self, uk: str = "ntHidnLhrlfclJar2z8wBg") -> Optional[Dict]:
"""使用简化参数直接请求"""
try:
timestamp = int(time.time() * 1000)
# 使用更简单的参数
params = {
'action': 'dynamic',
'uk': uk,
'type': 'newhome',
'num': '10',
'format': 'json',
'_': str(timestamp)
}
url = "https://mbd.baidu.com/webpage"
headers = {
'User-Agent': self.ua.random,
'Referer': 'https://baijiahao.baidu.com/',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest'
}
logger.info("尝试AJAX方式请求...")
response = self.session.get(
url,
params=params,
headers=headers,
timeout=10
)
logger.info(f"AJAX响应状态: {response.status_code}")
try:
data = json.loads(response.text)
logger.info("AJAX方式成功获取数据")
return data
except json.JSONDecodeError as e:
logger.error(f"JSON解析失败: {e}")
logger.info(f"响应内容: {response.text[:200]}")
return None
except Exception as e:
logger.error(f"AJAX方式失败: {e}")
return None
def fetch_all_methods(self, uk: str = "ntHidnLhrlfclJar2z8wBg") -> Optional[Dict]:
"""尝试所有方法获取数据"""
logger.info("=" * 50)
logger.info(f"开始获取百家号数据UK: {uk}")
logger.info("=" * 50)
# 方法1直接请求
logger.info("\n方法1直接请求接口...")
data = self.fetch_data_directly(uk)
if data and data.get("errno") == "0" and data.get("data", {}).get("list") is not None:
logger.info(f"✓ 方法1成功获取到 {len(data['data']['list'])} 条数据")
return data
else:
logger.info("✗ 方法1失败或数据为空")
# 方法2通过浏览器获取
logger.info("\n方法2浏览器模拟获取...")
data = self.fetch_via_browser(uk)
if data and data.get("errno") == "0" and data.get("data", {}).get("list") is not None:
logger.info(f"✓ 方法2成功获取到 {len(data['data']['list'])} 条数据")
return data
else:
logger.info("✗ 方法2失败或数据为空")
# 方法3AJAX请求
logger.info("\n方法3AJAX请求...")
data = self.fetch_with_ajax(uk)
if data and data.get("errno") == "0" and data.get("data", {}).get("list") is not None:
logger.info(f"✓ 方法3成功获取到 {len(data['data']['list'])} 条数据")
return data
else:
logger.info("✗ 方法3失败或数据为空")
# 方法4备用请求
logger.info("\n方法4尝试备用请求方式...")
data = self.try_backup_method(uk)
if data:
logger.info("✓ 方法4成功获取数据")
return data
else:
logger.error("所有方法都失败了")
return None
def try_backup_method(self, uk: str) -> Optional[Dict]:
"""备用方法尝试不同的URL和参数"""
backup_urls = [
"https://author.baidu.com/rest/2.0/ugc/dynamic",
"https://mbd.baidu.com/dynamic/api",
"https://baijiahao.baidu.com/builder/api"
]
for url in backup_urls:
try:
params = {
'action': 'list',
'uk': uk,
'page': '1',
'page_size': '10',
'_': str(int(time.time() * 1000))
}
headers = {
'User-Agent': self.ua.random,
'Referer': 'https://baijiahao.baidu.com/'
}
response = requests.get(url, params=params, headers=headers, timeout=10)
if response.status_code == 200:
try:
data = response.json()
if data:
logger.info(f"备用URL {url} 成功")
return data
except:
pass
except Exception as e:
logger.debug(f"备用URL {url} 失败: {e}")
return None
def display_simple_data(data):
"""简单展示数据"""
if not data or "data" not in data or "list" not in data["data"]:
print("没有有效的数据")
return
articles = data["data"]["list"]
print(f"\n获取到 {len(articles)} 篇文章:")
for idx, article in enumerate(articles[:10]): # 显示前10条
print(f"\n{'=' * 60}")
print(f"文章 {idx + 1}:")
item_data = article.get("itemData", {})
# 标题
title = item_data.get("title", "无标题")
# 清理标题中的换行符
title = title.replace('\n', ' ').strip()
if not title or title == "无标题":
# 尝试获取origin_title
title = item_data.get("origin_title", "无标题").replace('\n', ' ').strip()
print(f"标题: {title[:100]}{'...' if len(title) > 100 else ''}")
# 作者
display_info = item_data.get("displaytype_exinfo", "")
author = "未知作者"
if display_info:
try:
info = json.loads(display_info)
author = info.get("name", info.get("display_name", "未知作者"))
except:
# 尝试正则匹配
name_match = re.search(r'"name":"([^"]+)"', display_info)
if name_match:
author = name_match.group(1)
print(f"作者: {author}")
# 发布时间
time_str = item_data.get("time", item_data.get("cst_time", "未知时间"))
print(f"发布时间: {time_str}")
# 文章ID
thread_id = item_data.get("thread_id", article.get("thread_id", "未知"))
print(f"文章ID: {thread_id}")
# 图片信息
img_src = item_data.get("imgSrc", [])
if img_src:
print(f"包含图片: {len(img_src)}")
# 标签/话题
targets = item_data.get("target", [])
if targets:
tags = [t.get("key", "") for t in targets if t.get("key")]
if tags:
print(f"标签: {', '.join(tags)}")
def main():
"""主函数"""
spider = BaiduBJHSpider()
# 获取数据
data = spider.fetch_all_methods()
if data:
# 保存完整数据到文件
filename = f'baijiahao_data_{int(time.time())}.json'
with open(filename, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
logger.info(f"完整数据已保存到 {filename}")
# 简单展示数据
display_simple_data(data)
else:
print("未能获取到数据,建议:")
print("1. 检查网络连接")
print("2. 尝试使用代理")
print("3. 等待一段时间后重试")
print("4. 检查目标页面是否可正常访问")
if __name__ == "__main__":
# 设置更详细的日志
logging.getLogger("playwright").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
main()

90
test_database.py Normal file
View File

@@ -0,0 +1,90 @@
# -*- coding: utf-8 -*-
"""
测试 SQLite 数据库迁移和功能
"""
import os
import sys
from database import get_database, migrate_from_json
from task_queue import get_task_queue, TaskStatus
def test_database():
"""测试数据库功能"""
print("=" * 60)
print("开始测试 SQLite 数据库功能")
print("=" * 60)
# 1. 测试数据库初始化
print("\n1. 测试数据库初始化...")
db = get_database()
print(f"✓ 数据库初始化成功: {db.db_path}")
# 2. 测试从 JSON 迁移数据
print("\n2. 测试数据迁移...")
json_file = "data/task_queue.json"
if os.path.exists(json_file):
count = migrate_from_json(json_file)
print(f"✓ 迁移了 {count} 个任务")
else:
print("! 未找到旧 JSON 文件,跳过迁移")
# 3. 测试任务队列功能
print("\n3. 测试任务队列功能...")
queue = get_task_queue()
# 3.1 添加测试任务
print("\n3.1 添加测试任务...")
task_id = queue.add_task(
url="https://baijiahao.baidu.com/u?app_id=test123",
months=3,
use_proxy=True,
username="test_user"
)
print(f"✓ 添加任务成功: {task_id}")
# 3.2 获取任务
print("\n3.2 获取任务...")
task = queue.get_task(task_id)
if task:
print(f"✓ 获取任务成功:")
print(f" - URL: {task['url']}")
print(f" - 状态: {task['status']}")
print(f" - 创建时间: {task['created_at']}")
# 3.3 更新任务状态
print("\n3.3 更新任务状态...")
queue.update_task_status(task_id, TaskStatus.PROCESSING)
task = queue.get_task(task_id)
print(f"✓ 更新状态成功: {task['status']}")
# 3.4 更新任务进度
print("\n3.4 更新任务进度...")
queue.update_task_progress(task_id, 50, "正在处理中...", 25)
task = queue.get_task(task_id)
print(f"✓ 更新进度成功: {task['progress']}%")
# 3.5 获取队列统计
print("\n3.5 获取队列统计...")
stats = queue.get_queue_stats()
print(f"✓ 队列统计:")
print(f" - 总任务数: {stats['total']}")
print(f" - 等待中: {stats['pending']}")
print(f" - 处理中: {stats['processing']}")
print(f" - 已完成: {stats['completed']}")
print(f" - 失败: {stats['failed']}")
# 3.6 获取所有任务
print("\n3.6 获取所有任务...")
all_tasks = queue.get_all_tasks()
print(f"✓ 获取所有任务成功,共 {len(all_tasks)} 个任务")
# 3.7 删除测试任务
print("\n3.7 删除测试任务...")
queue.delete_task(task_id)
print(f"✓ 删除任务成功: {task_id}")
print("\n" + "=" * 60)
print("所有测试通过SQLite 数据库运行正常")
print("=" * 60)
if __name__ == "__main__":
test_database()

23
test_html.py Normal file
View File

@@ -0,0 +1,23 @@
from app import BaijiahaoScraper
app_id = "1700253559210167"
print(f"测试app_id: {app_id}\n")
uk, cookies = BaijiahaoScraper.get_uk_from_app_id(app_id)
print(f"UK: {uk}\n")
scraper = BaijiahaoScraper(uk, cookies)
# 测试HTML解析方式
print("使用HTML解析方式:")
articles = scraper.get_articles_from_html(app_id=app_id)
if articles:
print(f"\n成功! 获取到 {len(articles)} 篇文章")
print("\n前3篇:")
for i, article in enumerate(articles[:3], 1):
print(f"{i}. {article['标题']}")
print(f" {article['链接'][:80]}...")
else:
print("未获取到文章")

0
test_selenium.py Normal file
View File