2026-02-04 14:55:47 +08:00
|
|
|
|
# AI 图片去重审核系统
|
|
|
|
|
|
|
|
|
|
|
|
基于 DashScope 多模态 Embedding + DashVector 向量数据库的图片相似度检测系统。
|
|
|
|
|
|
|
|
|
|
|
|
## 功能特性
|
|
|
|
|
|
|
|
|
|
|
|
- 使用 DashScope 多模态 Embedding 生成图片向量
|
|
|
|
|
|
- 使用 DashVector 进行高效向量相似度搜索
|
|
|
|
|
|
- 支持 pHash 感知哈希预筛选
|
|
|
|
|
|
- 异步批量下载和处理图片
|
|
|
|
|
|
- 自动标记重复图片并记录相似度分数
|
2026-02-05 23:53:05 +08:00
|
|
|
|
- 守护模式运行,无数据时等待 2 秒后继续检查
|
2026-02-04 14:55:47 +08:00
|
|
|
|
|
|
|
|
|
|
## 环境要求
|
|
|
|
|
|
|
|
|
|
|
|
- Python 3.8+
|
|
|
|
|
|
- MySQL 数据库
|
|
|
|
|
|
- DashScope API Key
|
|
|
|
|
|
- DashVector API Key
|
|
|
|
|
|
|
|
|
|
|
|
## 安装
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
pip install -r requirements.txt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## 配置
|
|
|
|
|
|
|
|
|
|
|
|
创建 `config.ini` 配置文件:
|
|
|
|
|
|
|
|
|
|
|
|
```ini
|
|
|
|
|
|
[database]
|
|
|
|
|
|
host = localhost
|
|
|
|
|
|
port = 3306
|
|
|
|
|
|
user = root
|
|
|
|
|
|
password = your_password
|
|
|
|
|
|
database = your_database
|
|
|
|
|
|
charset = utf8mb4
|
|
|
|
|
|
|
|
|
|
|
|
[dashscope]
|
|
|
|
|
|
api_key = your_dashscope_api_key
|
|
|
|
|
|
|
|
|
|
|
|
[dashvector]
|
|
|
|
|
|
api_key = your_dashvector_api_key
|
|
|
|
|
|
endpoint = your_endpoint
|
|
|
|
|
|
collection_name = image_vectors
|
|
|
|
|
|
vector_dimension = 1024
|
|
|
|
|
|
|
|
|
|
|
|
[image]
|
|
|
|
|
|
cdn_base = https://your-cdn.com/
|
|
|
|
|
|
|
|
|
|
|
|
[similarity]
|
2026-02-05 19:01:38 +08:00
|
|
|
|
phash_threshold = 5
|
|
|
|
|
|
vector_threshold = 0.94
|
2026-02-04 14:55:47 +08:00
|
|
|
|
|
|
|
|
|
|
[process]
|
|
|
|
|
|
batch_size = 100
|
|
|
|
|
|
concurrent_downloads = 10
|
|
|
|
|
|
log_level = INFO
|
|
|
|
|
|
log_file = image_similarity.log
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## 使用方法
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
2026-02-05 19:01:38 +08:00
|
|
|
|
# 处理新图片 (status='draft', similarity='draft')
|
2026-02-04 14:55:47 +08:00
|
|
|
|
python image_similarity_check.py
|
2026-02-05 19:01:38 +08:00
|
|
|
|
|
|
|
|
|
|
# 重新处理失败的图片 (status='draft', similarity='recalc')
|
|
|
|
|
|
python image_similarity_recalc.py
|
|
|
|
|
|
|
|
|
|
|
|
# 查看统计报告
|
|
|
|
|
|
python stats_similarity.py
|
2026-02-04 14:55:47 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-02-05 21:32:28 +08:00
|
|
|
|
## 服务器部署
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
2026-02-05 23:53:05 +08:00
|
|
|
|
# 启动服务
|
2026-02-05 21:32:28 +08:00
|
|
|
|
./start_similarity.sh start
|
|
|
|
|
|
|
2026-02-05 23:53:05 +08:00
|
|
|
|
# 停止服务
|
2026-02-05 21:32:28 +08:00
|
|
|
|
./start_similarity.sh stop
|
|
|
|
|
|
|
|
|
|
|
|
# 强制停止
|
|
|
|
|
|
./start_similarity.sh force-stop
|
|
|
|
|
|
|
|
|
|
|
|
# 重启
|
|
|
|
|
|
./start_similarity.sh restart
|
|
|
|
|
|
|
|
|
|
|
|
# 查看进程状态
|
|
|
|
|
|
./start_similarity.sh status
|
|
|
|
|
|
|
|
|
|
|
|
# 查看统计报告
|
|
|
|
|
|
./start_similarity.sh stats
|
|
|
|
|
|
|
|
|
|
|
|
# 查看日志
|
|
|
|
|
|
./start_similarity.sh logs
|
|
|
|
|
|
|
|
|
|
|
|
# 实时查看日志
|
|
|
|
|
|
./start_similarity.sh logs-follow
|
|
|
|
|
|
```
|
|
|
|
|
|
|
2026-02-04 14:55:47 +08:00
|
|
|
|
## 项目结构
|
|
|
|
|
|
|
|
|
|
|
|
```
|
2026-02-05 19:01:38 +08:00
|
|
|
|
├── image_similarity_check.py # 主程序:处理新图片
|
|
|
|
|
|
├── image_similarity_recalc.py # 重算程序:处理失败的图片
|
|
|
|
|
|
├── stats_similarity.py # 统计脚本:查看处理结果
|
2026-02-05 21:32:28 +08:00
|
|
|
|
├── start_similarity.sh # 部署脚本:服务启停管理
|
2026-02-05 19:01:38 +08:00
|
|
|
|
├── query_status.py # 查询处理状态
|
|
|
|
|
|
├── reset_data.py # 重置数据
|
|
|
|
|
|
├── reset_vector.py # 重置向量库
|
|
|
|
|
|
├── config.ini # 配置文件
|
|
|
|
|
|
└── requirements.txt # 依赖包
|
2026-02-04 14:55:47 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## 工作流程
|
|
|
|
|
|
|
2026-02-05 19:01:38 +08:00
|
|
|
|
1. 从数据库获取待处理图片 (`status='draft'`, `similarity='draft'`)
|
|
|
|
|
|
2. 拼接 CDN URL:`cdn_base + image_url`
|
|
|
|
|
|
3. 调用 DashScope API 获取 1024 维向量
|
|
|
|
|
|
4. 在 DashVector 中搜索 topk=3 相似图片
|
|
|
|
|
|
5. 计算相似度:`similarity = 1.0 - score`
|
|
|
|
|
|
6. 判断结果:
|
|
|
|
|
|
- `similarity >= 0.94` → 标记为重复 (`status='similarity'`)
|
|
|
|
|
|
- `similarity < 0.94` → 标记为不重复 (`status='tag_extension'`),向量入库
|
|
|
|
|
|
- 处理失败 → 标记为待重算 (`similarity='recalc'`)
|