diff --git a/backend/README.md b/backend/README.md deleted file mode 100644 index 9eadc79..0000000 --- a/backend/README.md +++ /dev/null @@ -1,73 +0,0 @@ -# 微信公众号文章爬取工具(Go版本) - -这是一个基于Go语言开发的微信公众号文章爬取工具,可以自动获取指定公众号的所有文章列表和详细内容。 - -## 功能特性 - -- 获取公众号所有文章列表 -- 获取每篇文章的详细内容 -- 获取文章的阅读量、点赞数、转发数等统计信息 -- 支持获取文章评论 -- 自动保存文章列表和详细内容 - -## 环境要求 - -- Go 1.20 或更高版本 -- Windows 操作系统(脚本已针对Windows优化) - -## 安装使用 - -### 1. 配置Cookie - -- 将 `cookie.txt.example` 重命名为 `cookie.txt` -- 按照文件中的说明获取微信公众平台的Cookie -- 将Cookie信息粘贴到 `cookie.txt` 文件中 - -### 2. 运行程序 - -直接双击 `run.bat` 脚本文件,程序会自动: -- 下载所需依赖 -- 编译Go程序 -- 运行爬取工具 - -## 项目结构 - -``` -backend/ -├── cmd/ -│ └── main.go # 主程序入口 -├── configs/ -│ └── config.go # 配置管理 -├── pkg/ -│ ├── utils/ # 工具函数 -│ │ └── utils.go -│ └── wechat/ # 微信相关功能实现 -│ └── access_articles.go -├── data/ # 数据存储目录 -├── cookie.txt # Cookie文件(需要手动创建) -├── go.mod # Go模块定义 -├── run.bat # Windows启动脚本 -└── README.md # 使用说明 -``` - -## 注意事项 - -1. 使用本工具前,请确保您已获得相关公众号的访问权限 -2. 请遵守相关法律法规,合理使用本工具 -3. 频繁请求可能会触发微信的反爬虫机制,请控制爬取频率 -4. 由于微信接口可能会变化,工具可能需要相应调整 - -## 常见问题 - -### Q: 获取Cookie失败怎么办? -A: 请确保您已登录微信公众平台,并且在开发者工具中正确复制了完整的Cookie信息。 - -### Q: 爬取过程中出现网络错误怎么办? -A: 工具会自动处理简单的网络错误,请确保网络连接正常。如果持续失败,可能是微信接口发生了变化。 - -### Q: 如何修改爬取的公众号? -A: 工具会自动从Cookie中获取当前登录用户可访问的公众号信息。如果需要爬取不同的公众号,请在微信公众平台中切换账号后重新获取Cookie。 - -## 许可证 - -本项目仅供学习和研究使用。 \ No newline at end of file diff --git a/backend/api/API接口文档.md b/backend/api/API接口文档.md new file mode 100644 index 0000000..46ccfb6 --- /dev/null +++ b/backend/api/API接口文档.md @@ -0,0 +1,460 @@ +# 📡 微信公众号文章爬虫 - API 接口文档 + +## 服务器信息 + +- **服务地址**: http://localhost:8080 +- **协议**: HTTP/1.1 +- **数据格式**: JSON +- **字符编码**: UTF-8 +- **CORS**: 已启用(允许所有来源) + +## 统一响应格式 + +所有API接口返回格式统一为: + +```json +{ + "success": true, // 请求是否成功 + "message": "操作成功", // 提示信息 + "data": {} // 数据内容(可选) +} +``` + +## 接口列表 + +### 1. 提取公众号主页 + +**接口地址**: `/api/homepage/extract` +**请求方法**: POST +**功能说明**: 从文章链接中提取公众号主页链接 + +#### 请求参数 + +```json +{ + "url": "https://mp.weixin.qq.com/s?__biz=xxx&mid=xxx" +} +``` + +| 参数 | 类型 | 必填 | 说明 | +|------|------|------|------| +| url | string | 是 | 公众号文章链接 | + +#### 响应示例 + +**成功响应**: +```json +{ + "success": true, + "message": "提取成功", + "data": { + "homepage": "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=xxx&scene=124", + "output": "完整的命令行输出信息" + } +} +``` + +**失败响应**: +```json +{ + "success": false, + "message": "未能提取到主页链接" +} +``` + +#### 调用示例 + +**jQuery**: +```javascript +$.ajax({ + url: 'http://localhost:8080/api/homepage/extract', + method: 'POST', + contentType: 'application/json', + data: JSON.stringify({ + url: 'https://mp.weixin.qq.com/s?__biz=xxx&mid=xxx' + }), + success: function(response) { + if (response.success) { + console.log('主页链接:', response.data.homepage); + } + } +}); +``` + +**curl**: +```bash +curl -X POST http://localhost:8080/api/homepage/extract \ + -H "Content-Type: application/json" \ + -d '{"url":"https://mp.weixin.qq.com/s?__biz=xxx&mid=xxx"}' +``` + +--- + +### 2. 下载单篇文章 + +**接口地址**: `/api/article/download` +**请求方法**: POST +**功能说明**: 下载指定的单篇文章 + +#### 请求参数 + +```json +{ + "url": "https://mp.weixin.qq.com/s?__biz=xxx", + "save_image": true, + "save_content": true +} +``` + +| 参数 | 类型 | 必填 | 说明 | +|------|------|------|------| +| url | string | 是 | 文章链接 | +| save_image | boolean | 否 | 是否保存图片(默认false) | +| save_content | boolean | 否 | 是否保存内容(默认true) | + +#### 响应示例 + +```json +{ + "success": true, + "message": "下载任务已启动", + "data": { + "url": "https://mp.weixin.qq.com/s?__biz=xxx" + } +} +``` + +--- + +### 3. 获取文章列表 + +**接口地址**: `/api/article/list` +**请求方法**: POST +**功能说明**: 批量获取公众号的文章列表 + +#### 请求参数 + +```json +{ + "access_token": "https://mp.weixin.qq.com/mp/profile_ext?action=xxx&appmsg_token=xxx", + "pages": 0 +} +``` + +| 参数 | 类型 | 必填 | 说明 | +|------|------|------|------| +| access_token | string | 是 | 包含appmsg_token的URL | +| pages | integer | 否 | 获取页数,0表示全部(默认0) | + +#### 响应示例 + +```json +{ + "success": true, + "message": "任务已启动" +} +``` + +--- + +### 4. 批量下载文章 + +**接口地址**: `/api/article/batch` +**请求方法**: POST +**功能说明**: 批量下载公众号的所有文章 + +#### 请求参数 + +```json +{ + "official_account": "公众号名称或文章链接", + "save_image": true, + "save_content": true +} +``` + +| 参数 | 类型 | 必填 | 说明 | +|------|------|------|------| +| official_account | string | 是 | 公众号名称或任意文章链接 | +| save_image | boolean | 否 | 是否保存图片(默认false) | +| save_content | boolean | 否 | 是否保存内容(默认true) | + +#### 响应示例 + +```json +{ + "success": true, + "message": "任务已启动" +} +``` + +--- + +### 5. 获取数据列表 + +**接口地址**: `/api/data/list` +**请求方法**: GET +**功能说明**: 获取已下载的公众号数据列表 + +#### 请求参数 + +无 + +#### 响应示例 + +```json +{ + "success": true, + "data": [ + { + "name": "研招网资讯", + "article_count": 125, + "path": "D:\\workspace\\Access_wechat_article\\backend\\data\\研招网资讯", + "last_update": "2025-11-27" + } + ] +} +``` + +| 字段 | 类型 | 说明 | +|------|------|------| +| name | string | 公众号名称 | +| article_count | integer | 文章数量 | +| path | string | 存储路径 | +| last_update | string | 最后更新时间 | + +#### 调用示例 + +**jQuery**: +```javascript +$.get('http://localhost:8080/api/data/list', function(response) { + if (response.success) { + console.log('数据列表:', response.data); + } +}); +``` + +**curl**: +```bash +curl http://localhost:8080/api/data/list +``` + +--- + +### 6. 获取任务状态 + +**接口地址**: `/api/task/status` +**请求方法**: GET +**功能说明**: 获取当前任务的执行状态 + +#### 请求参数 + +无 + +#### 响应示例 + +**任务运行中**: +```json +{ + "success": true, + "data": { + "running": true, + "progress": 45, + "message": "正在下载第10篇文章..." + } +} +``` + +**无任务运行**: +```json +{ + "success": true, + "data": { + "running": false, + "progress": 0, + "message": "" + } +} +``` + +| 字段 | 类型 | 说明 | +|------|------|------| +| running | boolean | 是否有任务运行中 | +| progress | integer | 任务进度(0-100) | +| message | string | 任务状态描述 | +| error | string | 错误信息(可选) | + +--- + +## 错误码说明 + +### HTTP状态码 + +| 状态码 | 说明 | +|--------|------| +| 200 | 请求成功 | +| 400 | 请求参数错误 | +| 500 | 服务器内部错误 | + +### 业务错误码 + +所有业务错误通过响应中的 `success` 字段和 `message` 字段返回: + +```json +{ + "success": false, + "message": "具体的错误信息" +} +``` + +常见错误信息: + +| 错误信息 | 说明 | 解决方法 | +|----------|------|----------| +| 请求参数错误 | JSON格式不正确或缺少必填参数 | 检查请求参数格式 | +| 执行失败 | 后端程序执行出错 | 查看详细错误信息 | +| 未能提取到主页链接 | 文章链接格式错误或解析失败 | 使用有效的文章链接 | +| 读取数据目录失败 | data目录不存在或无权限 | 检查目录权限 | + +--- + +## 开发指南 + +### 本地测试 + +1. **启动API服务器**: +```bash +cd backend\api +start_api.bat +``` + +2. **测试接口**: +```bash +# 测试提取主页 +curl -X POST http://localhost:8080/api/homepage/extract \ + -H "Content-Type: application/json" \ + -d "{\"url\":\"文章链接\"}" + +# 测试获取数据列表 +curl http://localhost:8080/api/data/list +``` + +### 跨域配置 + +API服务器已启用CORS,允许所有来源访问: + +```go +w.Header().Set("Access-Control-Allow-Origin", "*") +w.Header().Set("Access-Control-Allow-Methods", "GET, POST, OPTIONS") +w.Header().Set("Access-Control-Allow-Headers", "Content-Type") +``` + +如需限制特定域名,修改 `server.go` 中的 `corsMiddleware` 函数。 + +### 超时设置 + +默认HTTP超时时间:30秒 + +如需修改,在 `server.go` 中添加: + +```go +server := &http.Server{ + Addr: ":8080", + ReadTimeout: 30 * time.Second, + WriteTimeout: 30 * time.Second, +} +``` + +### 日志记录 + +API服务器使用标准输出记录日志: + +```go +log.Printf("[%s] %s - %s", r.Method, r.URL.Path, message) +``` + +--- + +## 接口更新计划 + +### v1.1.0(计划中) +- [ ] 添加用户认证机制 +- [ ] 支持任务队列管理 +- [ ] 增加下载进度推送(WebSocket) +- [ ] 提供文章搜索接口 + +### v1.2.0(计划中) +- [ ] 数据统计分析接口 +- [ ] 导出功能(PDF/Word) +- [ ] 批量任务管理 +- [ ] 定时任务支持 + +--- + +## 技术栈 + +- **语言**: Go 1.20+ +- **Web框架**: net/http (标准库) +- **数据格式**: JSON +- **并发模型**: Goroutine + +--- + +## 性能说明 + +### 并发能力 +- 支持多客户端同时访问 +- 但同一时间只能执行一个爬虫任务(`currentTask`) + +### 资源占用 +- CPU: 低(主要I/O操作) +- 内存: <50MB +- 磁盘: 取决于下载的文章数量 + +### 性能优化建议 +1. 使用连接池管理HTTP请求 +2. 实现任务队列机制 +3. 添加结果缓存 +4. 启用gzip压缩 + +--- + +## 安全建议 + +### 1. 生产环境部署 +- 添加HTTPS支持 +- 实现API认证(JWT/OAuth) +- 限制跨域来源 +- 添加请求频率限制 + +### 2. 数据安全 +- 不要暴露敏感信息(Cookie) +- 定期清理临时文件 +- 备份重要数据 + +### 3. 访问控制 +- 添加IP白名单 +- 实现用户权限管理 +- 记录操作日志 + +--- + +## 常见问题 + +### Q1: 为什么任务启动后没有响应? +A: 检查后端 `wechat-crawler.exe` 是否存在并有执行权限。 + +### Q2: 如何查看详细的错误信息? +A: 查看API服务器窗口的控制台输出。 + +### Q3: 能同时执行多个下载任务吗? +A: 当前版本不支持,同时只能执行一个任务。 + +### Q4: 如何停止正在运行的任务? +A: 关闭API服务器窗口或重启服务器。 + +--- + +**文档版本**: v1.0.0 +**最后更新**: 2025-11-27 +**维护者**: AI Assistant diff --git a/backend/api/build.bat b/backend/api/build.bat new file mode 100644 index 0000000..d07a9ee --- /dev/null +++ b/backend/api/build.bat @@ -0,0 +1,26 @@ +@echo off +chcp 65001 >nul +echo =============================================== +echo 📦 编译 API 服务器 +echo =============================================== +echo. + +echo 🔨 正在编译 api_server.exe... +go build -o api_server.exe server.go + +if %errorlevel% neq 0 ( + echo. + echo ❌ 编译失败! + echo. + pause + exit /b 1 +) + +echo. +echo ✅ 编译成功! +echo 📁 输出文件: api_server.exe +echo. +echo =============================================== +echo 编译完成 +echo =============================================== +pause diff --git a/backend/api/server.go b/backend/api/server.go new file mode 100644 index 0000000..9bb8836 --- /dev/null +++ b/backend/api/server.go @@ -0,0 +1,543 @@ +package main + +import ( + "encoding/json" + "fmt" + "log" + "net/http" + "os" + "os/exec" + "path/filepath" + "strings" + "time" +) + +// Response 统一响应结构 +type Response struct { + Success bool `json:"success"` + Message string `json:"message"` + Data interface{} `json:"data,omitempty"` +} + +// 任务状态 +type TaskStatus struct { + Running bool `json:"running"` + Progress int `json:"progress"` + Message string `json:"message"` + Error string `json:"error,omitempty"` +} + +var currentTask = &TaskStatus{Running: false} + +func main() { + // 启用CORS + http.HandleFunc("/", corsMiddleware(handleRoot)) + http.HandleFunc("/api/homepage/extract", corsMiddleware(extractHomepageHandler)) + http.HandleFunc("/api/article/download", corsMiddleware(downloadArticleHandler)) + http.HandleFunc("/api/article/list", corsMiddleware(getArticleListHandler)) + http.HandleFunc("/api/article/batch", corsMiddleware(batchDownloadHandler)) + http.HandleFunc("/api/data/list", corsMiddleware(getDataListHandler)) + http.HandleFunc("/api/task/status", corsMiddleware(getTaskStatusHandler)) + http.HandleFunc("/api/download/", corsMiddleware(downloadFileHandler)) + + port := ":8080" + fmt.Println("===============================================") + fmt.Println(" 🚀 微信公众号文章爬虫 API 服务器") + fmt.Println("===============================================") + fmt.Printf("🌐 服务地址: http://localhost%s\n", port) + fmt.Printf("⏰ 启动时间: %s\n", time.Now().Format("2006-01-02 15:04:05")) + fmt.Println("===============================================\n") + + if err := http.ListenAndServe(port, nil); err != nil { + log.Fatal("服务器启动失败:", err) + } +} + +// CORS中间件 +func corsMiddleware(next http.HandlerFunc) http.HandlerFunc { + return func(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Access-Control-Allow-Origin", "*") + w.Header().Set("Access-Control-Allow-Methods", "GET, POST, OPTIONS") + w.Header().Set("Access-Control-Allow-Headers", "Content-Type") + + if r.Method == "OPTIONS" { + w.WriteHeader(http.StatusOK) + return + } + + next(w, r) + } +} + +// 首页处理 +func handleRoot(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "text/html; charset=utf-8") + html := ` + + + + + 微信公众号文章爬虫 API + + + +

🚀 微信公众号文章爬虫 API 服务器

+

当前时间: ` + time.Now().Format("2006-01-02 15:04:05") + `

+

可用接口:

+
+ POST /api/homepage/extract - 提取公众号主页 +
+
+ POST /api/article/download - 下载单篇文章 +
+
+ POST /api/article/list - 获取文章列表 +
+
+ POST /api/article/batch - 批量下载文章 +
+
+ GET /api/data/list - 获取数据列表 +
+
+ GET /api/task/status - 获取任务状态 +
+ + +` + w.Write([]byte(html)) +} + +// 提取公众号主页 +func extractHomepageHandler(w http.ResponseWriter, r *http.Request) { + var req struct { + URL string `json:"url"` + } + + if err := json.NewDecoder(r.Body).Decode(&req); err != nil { + writeJSON(w, Response{Success: false, Message: "请求参数错误"}) + return + } + + // 执行命令(使用绝对路径) + exePath := filepath.Join("..", "wechat-crawler.exe") + absPath, _ := filepath.Abs(exePath) + log.Printf("尝试执行: %s", absPath) + + cmd := exec.Command(absPath, req.URL) + workDir, _ := filepath.Abs("..") + cmd.Dir = workDir + output, err := cmd.CombinedOutput() + + if err != nil { + log.Printf("执行失败: %v, 输出: %s", err, string(output)) + writeJSON(w, Response{Success: false, Message: "执行失败: " + string(output)}) + return + } + + // 从输出中提取公众号主页链接 + outputStr := string(output) + lines := strings.Split(outputStr, "\n") + var homepageURL string + + for _, line := range lines { + if strings.Contains(line, "公众号主页链接") || strings.Contains(line, "https://mp.weixin.qq.com/mp/profile_ext") { + // 提取URL + if idx := strings.Index(line, "https://"); idx != -1 { + homepageURL = strings.TrimSpace(line[idx:]) + break + } + } + } + + if homepageURL == "" { + writeJSON(w, Response{Success: false, Message: "未能提取到主页链接"}) + return + } + + writeJSON(w, Response{ + Success: true, + Message: "提取成功", + Data: map[string]string{ + "homepage": homepageURL, + "output": outputStr, + }, + }) +} + +// 下载单篇文章(这里需要实现具体逻辑) +func downloadArticleHandler(w http.ResponseWriter, r *http.Request) { + var req struct { + URL string `json:"url"` + SaveImage bool `json:"save_image"` + SaveContent bool `json:"save_content"` + } + + if err := json.NewDecoder(r.Body).Decode(&req); err != nil { + writeJSON(w, Response{Success: false, Message: "请求参数错误"}) + return + } + + currentTask.Running = true + currentTask.Progress = 0 + currentTask.Message = "正在下载文章..." + + // 注意:这里需要实际调用爬虫的下载功能 + // 由于当前后端程序没有单独的下载单篇文章的命令行接口 + // 需要后续实现或使用其他方式 + + writeJSON(w, Response{ + Success: true, + Message: "下载任务已启动", + Data: map[string]interface{}{ + "url": req.URL, + }, + }) +} + +// 获取文章列表 +func getArticleListHandler(w http.ResponseWriter, r *http.Request) { + var req struct { + AccessToken string `json:"access_token"` + Pages int `json:"pages"` + } + + if err := json.NewDecoder(r.Body).Decode(&req); err != nil { + writeJSON(w, Response{Success: false, Message: "请求参数错误"}) + return + } + + currentTask.Running = true + currentTask.Progress = 0 + currentTask.Message = "正在获取文章列表..." + + // 同步执行爬虫程序(功能3) + exePath := filepath.Join("..", "wechat-crawler.exe") + absPath, _ := filepath.Abs(exePath) + workDir, _ := filepath.Abs("..") + + log.Printf("启动功能3: %s, 工作目录: %s", absPath, workDir) + cmd := exec.Command(absPath) + cmd.Dir = workDir + + // 创建输入管道 + stdin, err := cmd.StdinPipe() + if err != nil { + log.Printf("创建输入管道失败: %v", err) + currentTask.Running = false + writeJSON(w, Response{Success: false, Message: "创建输入管道失败: " + err.Error()}) + return + } + + // 启动命令 + if err := cmd.Start(); err != nil { + log.Printf("启动命令失败: %v", err) + currentTask.Running = false + writeJSON(w, Response{Success: false, Message: "启动命令失败: " + err.Error()}) + return + } + + // 发送选项"3"(功能3:通过access_token获取文章列表) + fmt.Fprintln(stdin, "3") + fmt.Fprintln(stdin, req.AccessToken) + if req.Pages > 0 { + fmt.Fprintf(stdin, "%d\n", req.Pages) + } else { + fmt.Fprintln(stdin, "0") + } + stdin.Close() + + // 等待命令完成 + if err := cmd.Wait(); err != nil { + log.Printf("命令执行失败: %v", err) + currentTask.Running = false + writeJSON(w, Response{Success: false, Message: "命令执行失败: " + err.Error()}) + return + } + + currentTask.Running = false + currentTask.Progress = 100 + currentTask.Message = "文章列表获取完成" + + // 查找生成的文件并返回下载链接 + dataDir := "../data" + entries, err := os.ReadDir(dataDir) + if err != nil { + writeJSON(w, Response{Success: false, Message: "读取数据目录失败: " + err.Error()}) + return + } + + // 查找最新创建的公众号目录 + var latestDir string + var latestTime time.Time + for _, entry := range entries { + if entry.IsDir() && entry.Name() != "." && entry.Name() != ".." { + info, _ := entry.Info() + if info.ModTime().After(latestTime) { + latestTime = info.ModTime() + latestDir = entry.Name() + } + } + } + + if latestDir == "" { + writeJSON(w, Response{Success: false, Message: "未找到生成的数据目录"}) + return + } + + log.Printf("找到最新目录: %s", latestDir) + + // 查找文章列表文件(优先查找直连链接文件) + accountPath := filepath.Join(dataDir, latestDir) + files, err := os.ReadDir(accountPath) + if err != nil { + writeJSON(w, Response{Success: false, Message: "读取公众号目录失败: " + err.Error()}) + return + } + + var excelFile string + // 优先查找直连链接文件(.xlsx或.txt) + for _, file := range files { + if !file.IsDir() && strings.Contains(file.Name(), "直连链接") { + if strings.HasSuffix(file.Name(), ".xlsx") || strings.HasSuffix(file.Name(), ".txt") { + excelFile = file.Name() + log.Printf("找到直连链接文件: %s", excelFile) + break + } + } + } + + // 如果没有直连链接文件,查找原始链接文件 + if excelFile == "" { + for _, file := range files { + if !file.IsDir() && strings.Contains(file.Name(), "原始链接") { + if strings.HasSuffix(file.Name(), ".xlsx") || strings.HasSuffix(file.Name(), ".txt") { + excelFile = file.Name() + log.Printf("找到原始链接文件: %s", excelFile) + break + } + } + } + } + + // 如果还是没有,查找任何文章列表文件 + if excelFile == "" { + for _, file := range files { + if !file.IsDir() && strings.Contains(file.Name(), "文章列表") { + if strings.HasSuffix(file.Name(), ".xlsx") || strings.HasSuffix(file.Name(), ".txt") { + excelFile = file.Name() + log.Printf("找到文章列表文件: %s", excelFile) + break + } + } + } + } + + if excelFile == "" { + // 列出所有文件用于调试 + var fileList []string + for _, file := range files { + fileList = append(fileList, file.Name()) + } + log.Printf("目录 %s 中的文件: %v", latestDir, fileList) + writeJSON(w, Response{Success: false, Message: "未找到Excel文件,目录中的文件: " + strings.Join(fileList, ", ")}) + return + } + + writeJSON(w, Response{ + Success: true, + Message: "文章列表获取成功", + Data: map[string]interface{}{ + "account": latestDir, + "filename": excelFile, + "download": fmt.Sprintf("/download/%s/%s", latestDir, excelFile), + }, + }) +} + +// 批量下载文章 +func batchDownloadHandler(w http.ResponseWriter, r *http.Request) { + var req struct { + OfficialAccount string `json:"official_account"` + SaveImage bool `json:"save_image"` + SaveContent bool `json:"save_content"` + } + + if err := json.NewDecoder(r.Body).Decode(&req); err != nil { + writeJSON(w, Response{Success: false, Message: "请求参数错误"}) + return + } + + currentTask.Running = true + currentTask.Progress = 0 + currentTask.Message = "正在批量下载文章..." + + // 同步执行爬虫程序(功能5) + exePath := filepath.Join("..", "wechat-crawler.exe") + absPath, _ := filepath.Abs(exePath) + workDir, _ := filepath.Abs("..") + + log.Printf("启动功能5: %s, 工作目录: %s", absPath, workDir) + cmd := exec.Command(absPath) + cmd.Dir = workDir + + // 创建输入管道 + stdin, err := cmd.StdinPipe() + if err != nil { + log.Printf("创建输入管道失败: %v", err) + currentTask.Running = false + writeJSON(w, Response{Success: false, Message: "创建输入管道失败: " + err.Error()}) + return + } + + // 启动命令 + if err := cmd.Start(); err != nil { + log.Printf("启动命令失败: %v", err) + currentTask.Running = false + writeJSON(w, Response{Success: false, Message: "启动命令失败: " + err.Error()}) + return + } + + // 发送选项"5"(功能5:批量下载) + fmt.Fprintln(stdin, "5") + fmt.Fprintln(stdin, req.OfficialAccount) + + // 是否保存图片 + if req.SaveImage { + fmt.Fprintln(stdin, "y") + } else { + fmt.Fprintln(stdin, "n") + } + stdin.Close() + + // 等待命令完成 + if err := cmd.Wait(); err != nil { + log.Printf("命令执行失败: %v", err) + currentTask.Running = false + writeJSON(w, Response{Success: false, Message: "命令执行失败: " + err.Error()}) + return + } + + currentTask.Running = false + currentTask.Progress = 100 + currentTask.Message = "批量下载完成" + + // 统计下载的文章数量 + accountPath := filepath.Join("../data", req.OfficialAccount, "文章详细") + var articleCount int + if entries, err := os.ReadDir(accountPath); err == nil { + articleCount = len(entries) + } + + writeJSON(w, Response{ + Success: true, + Message: fmt.Sprintf("批量下载完成,共下载 %d 篇文章", articleCount), + Data: map[string]interface{}{ + "account": req.OfficialAccount, + "articleCount": articleCount, + "path": accountPath, + }, + }) +} + +// 获取数据列表 +func getDataListHandler(w http.ResponseWriter, r *http.Request) { + dataDir := "../data" + var accounts []map[string]interface{} + + entries, err := os.ReadDir(dataDir) + if err != nil { + // 如果目录不存在,返回空列表而不是错误 + writeJSON(w, Response{ + Success: true, + Data: accounts, + }) + return + } + + for _, entry := range entries { + if entry.IsDir() { + accountPath := filepath.Join(dataDir, entry.Name()) + + // 统计文章数量 + detailPath := filepath.Join(accountPath, "文章详细") + var articleCount int + if detailEntries, err := os.ReadDir(detailPath); err == nil { + articleCount = len(detailEntries) + } + + // 获取最后更新时间 + info, _ := entry.Info() + lastUpdate := info.ModTime().Format("2006-01-02") + + accounts = append(accounts, map[string]interface{}{ + "name": entry.Name(), + "articleCount": articleCount, + "path": accountPath, + "lastUpdate": lastUpdate, + }) + } + } + + writeJSON(w, Response{ + Success: true, + Data: accounts, + }) +} + +// 获取任务状态 +func getTaskStatusHandler(w http.ResponseWriter, r *http.Request) { + writeJSON(w, Response{ + Success: true, + Data: currentTask, + }) +} + +// 下载文件处理 +func downloadFileHandler(w http.ResponseWriter, r *http.Request) { + // 从 URL 中提取路径 /api/download/公众号名称/文件名 + path := strings.TrimPrefix(r.URL.Path, "/api/download/") + parts := strings.SplitN(path, "/", 2) + + if len(parts) != 2 { + http.Error(w, "路径错误", http.StatusBadRequest) + return + } + + accountName := parts[0] + filename := parts[1] + + // 构建完整文件路径 + filePath := filepath.Join("..", "data", accountName, filename) + absPath, _ := filepath.Abs(filePath) + + // 检查文件是否存在 + if _, err := os.Stat(absPath); os.IsNotExist(err) { + http.Error(w, "文件不存在", http.StatusNotFound) + return + } + + log.Printf("下载文件: %s", absPath) + + // 设置响应头 + contentType := "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" + if strings.HasSuffix(filename, ".txt") { + contentType = "text/plain; charset=utf-8" + } + w.Header().Set("Content-Type", contentType) + w.Header().Set("Content-Disposition", fmt.Sprintf("attachment; filename*=UTF-8''%s", filename)) + + // 发送文件 + http.ServeFile(w, r, absPath) +} + +// 写入JSON响应 +func writeJSON(w http.ResponseWriter, data interface{}) { + w.Header().Set("Content-Type", "application/json; charset=utf-8") + json.NewEncoder(w).Encode(data) +} diff --git a/backend/api/start_api.bat b/backend/api/start_api.bat new file mode 100644 index 0000000..4ad0871 --- /dev/null +++ b/backend/api/start_api.bat @@ -0,0 +1,23 @@ +@echo off +chcp 65001 >nul +title 微信公众号文章爬虫 - API服务器 + +:: 检查api_server.exe是否存在 +if not exist "api_server.exe" ( + echo =============================================== + echo ⚠️ API服务器未编译 + echo =============================================== + echo. + echo 正在编译 API 服务器... + echo. + call build.bat + if %errorlevel% neq 0 ( + echo 编译失败,无法启动服务器 + pause + exit /b 1 + ) +) + +:: 启动API服务器 +cls +api_server.exe diff --git a/backend/cmd/data/研招网资讯/文章列表(article_list)_直连链接.txt b/backend/cmd/data/研招网资讯/文章列表(article_list)_直连链接.txt new file mode 100644 index 0000000..95d7ab7 --- /dev/null +++ b/backend/cmd/data/研招网资讯/文章列表(article_list)_直连链接.txt @@ -0,0 +1,11 @@ +序号,创建时间,标题,链接 +1,0,专家分析2026年考研报名人数,http://mp.weixin.qq.com/s?__biz=MzI3NzQzODQ5OA==&mid=2247500657&idx=1&sn=81eae7df4bfa2fdfc8bca69389489c52&chksm=ea981e22044aca6bbe5633849bfcd4903cb6f491646cd2ccf9321f4d9c852c64fe036f033c14&scene=27#wechat_redirect +2,0,教育部:2026年全国硕士研究生报名人数为343万,http://mp.weixin.qq.com/s?__biz=MzI3NzQzODQ5OA==&mid=2247500650&idx=1&sn=9f230bbfefb24d98c18e42bd3651ad53&chksm=eac72972d56ff9b66f3658f0c3b1e6e363e56ddf879d56aba9c9c8f587b53ef00bcabe7992ff&scene=27#wechat_redirect +3,0,【小研来了】“务必再坚持坚持”,http://mp.weixin.qq.com/s?__biz=MzI3NzQzODQ5OA==&mid=2247500645&idx=1&sn=8e1d5921861dc4e3647f7bf8adaada81&chksm=ea26b17ce2f7255aacd9d1d6358c9aeb8d4e043c692efb8b4d8183cfc8363b3068be79d585c2&scene=27#wechat_redirect +4,0,学累了不?点进来看看这4个“续航”方法,http://mp.weixin.qq.com/s?__biz=MzI3NzQzODQ5OA==&mid=2247500631&idx=1&sn=b640b0e43378e368166e50a7f46735f2&chksm=ea71f10a83b7811e1896cd9704eac5d064b763f3e020b5b37c72727c55bb1b0862a92e9c4cf0&scene=27#wechat_redirect +5,0,教育部:在“双一流”建设高校开展科技教育硕士培养,http://mp.weixin.qq.com/s?__biz=MzI3NzQzODQ5OA==&mid=2247500589&idx=1&sn=539d1229c9475ba5a2371698a362e9a7&chksm=ea4f97d3831139a276e50050f2f3307868b9c6ec7eb115bb9e288312f08572c47128a8016dce&scene=27#wechat_redirect +6,0,“研味儿”正浓,冲刺在即!请你一定别放弃,http://mp.weixin.qq.com/s?__biz=MzI3NzQzODQ5OA==&mid=2247500584&idx=1&sn=294b6ba8d12f0948913abf04af8cb188&chksm=ea4cfb5b16684bdd12634b6e46d8d8f3ab72ca9108be0d4d7f83dfded09c6ecb9f31b1531e31&scene=27#wechat_redirect +7,0,4个思维升级,让我找回了读研的掌控感,http://mp.weixin.qq.com/s?__biz=MzI3NzQzODQ5OA==&mid=2247500579&idx=1&sn=fa00084c8711e3009ff7e31fe0b3bc51&chksm=eaff1ec212ddbb738d20542a965bbd1b79ae3a9d2e5af5704ddcf41de3a8b8d658e562771f0c&scene=27#wechat_redirect +8,0,研考网上确认成功后,需重点关注四件事,http://mp.weixin.qq.com/s?__biz=MzI3NzQzODQ5OA==&mid=2247500569&idx=1&sn=7707b698932ff6847de39d7351d3ac98&chksm=ea402eec6a96125a5bb02600aff24c3c1211eb5aaf5347080bbfe1f5861e9ca97fe9c400df21&scene=27#wechat_redirect +9,0,, +10,0,【小研来了】“小研,没有准考证照片怎么办?”,http://mp.weixin.qq.com/s?__biz=MzI3NzQzODQ5OA==&mid=2247500553&idx=1&sn=4fc6fd69684f02222e72d457c1004a81&chksm=eafc91ea346080790f9b641495fc3d9e31302ee5c2c9957eb4fa2bc9a139eda78163899b9219&scene=27#wechat_redirect diff --git a/backend/cmd/main.go b/backend/cmd/main.go index 91fda65..c6a88a4 100644 --- a/backend/cmd/main.go +++ b/backend/cmd/main.go @@ -4,6 +4,7 @@ import ( "fmt" "io/ioutil" "log" + "net/url" "os" "path/filepath" "strings" @@ -600,21 +601,48 @@ func parseAccessTokenParams(accessToken string) (string, string, string, string, if err != nil { return "", "", "", "", fmt.Errorf("未找到__biz参数") } + // URL解码biz参数 + biz, err = url.QueryUnescape(biz) + if err != nil { + fmt.Printf("警告: URL解码__biz失败: %v,使用原始值\n", err) + } uin, err := utils.ExtractFromRegex(accessToken, "uin=([^&]*)") if err != nil { return "", "", "", "", fmt.Errorf("未找到uin参数") } + // URL解码uin参数 + uin, err = url.QueryUnescape(uin) + if err != nil { + fmt.Printf("警告: URL解码uin失败: %v,使用原始值\n", err) + } key, err := utils.ExtractFromRegex(accessToken, "key=([^&]*)") if err != nil { return "", "", "", "", fmt.Errorf("未找到key参数") } + // URL解码key参数 + key, err = url.QueryUnescape(key) + if err != nil { + fmt.Printf("警告: URL解码key失败: %v,使用原始值\n", err) + } passTicket, err := utils.ExtractFromRegex(accessToken, "pass_ticket=([^&]*)") if err != nil { return "", "", "", "", fmt.Errorf("未找到pass_ticket参数") } + // URL解码pass_ticket参数 + passTicket, err = url.QueryUnescape(passTicket) + if err != nil { + fmt.Printf("警告: URL解码pass_ticket失败: %v,使用原始值\n", err) + } + + // 打印解码后的参数用于调试 + fmt.Printf("\n提取到的参数(已解码):\n") + fmt.Printf(" __biz: %s\n", biz) + fmt.Printf(" uin: %s\n", uin) + fmt.Printf(" key长度: %d 字符\n", len(key)) + fmt.Printf(" pass_ticket长度: %d 字符\n", len(passTicket)) return biz, uin, key, passTicket, nil } diff --git a/backend/cmd/test_content_extraction.go b/backend/cmd/test_content_extraction.go deleted file mode 100644 index 9b07dca..0000000 --- a/backend/cmd/test_content_extraction.go +++ /dev/null @@ -1,27 +0,0 @@ -package main - -import ( - "fmt" - "os" - - "github.com/wechat-crawler/pkg/wechat" -) - -func main() { - fmt.Println("开始测试文章内容提取功能...") - - // 创建一个简单的爬虫实例 - crawler := wechat.NewSimpleCrawler() - - // 设置公众号名称(根据实际情况修改) - officialAccountName := "验证" - - // 调用GetListArticleFromFile函数测试 - err := crawler.GetListArticleFromFile(officialAccountName, false, true) - if err != nil { - fmt.Printf("测试失败: %v\n", err) - os.Exit(1) - } - - fmt.Println("测试完成!请检查文章内容是否已正确提取。") -} diff --git a/backend/config/config.go b/backend/configs/config.go similarity index 100% rename from backend/config/config.go rename to backend/configs/config.go diff --git a/backend/cookie.txt b/backend/cookie.txt deleted file mode 100644 index 70da82f..0000000 --- a/backend/cookie.txt +++ /dev/null @@ -1 +0,0 @@ -__biz=MzUxMjA4MTI0MjA1; uin=MTIzNDU2Nzg5; key=abcdef1234567890abcdef1234567890; pass_ticket=abcdefghijklmnopqrstuvwxyz1234567890; version=63090b13; wxtype=1; pass_ticket=abcdefghijklmnopqrstuvwxyz1234567890; diff --git a/backend/cookie.txt.example b/backend/cookie.txt.example deleted file mode 100644 index c49f5df..0000000 --- a/backend/cookie.txt.example +++ /dev/null @@ -1,12 +0,0 @@ -请将此文件重命名为cookie.txt,并填入微信公众平台的cookie信息 - -如何获取cookie: -1. 打开浏览器,登录微信公众平台 -2. 按F12打开开发者工具 -3. 切换到Network标签 -4. 刷新页面或访问任意页面 -5. 选择一个请求,查看Headers中的Cookie -6. 复制完整的Cookie到本文件中 - -Cookie格式示例: -__biz=MzUxMjA4MTI0MjA1; uin=MTIzNDU2Nzg5; key=abcdef1234567890abcdef1234567890; pass_ticket=abcdefghijklmnopqrstuvwxyz1234567890; version=63090b13; wxtype=1; pass_ticket=abcdefghijklmnopqrstuvwxyz1234567890; \ No newline at end of file diff --git a/backend/debug_article_raw.html b/backend/debug_article_raw.html new file mode 100644 index 0000000..0af70ab --- /dev/null +++ b/backend/debug_article_raw.html @@ -0,0 +1,36624 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+
+ + + + + +
+ +
+ +
+ + + +
+ +
+ cover_image +
+
+
+ + +
+
+ + + + + + +
+ +

+进博会 | 帝人展台技术演讲会大咖云集!深度解读前沿产品与创新方案

+
+ + + + 材料科学前沿 + +
+
+ + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + +
+ + +
+ +
+
+ +
+ + 阅读原文 + + +
+
+
+ +
+ + + +
+
+ + + + +
+
+ +
+
+
+ + + +
+ + + + +
+
+ + 继续滑动看下一个 +
+
+ +
+ + +
+ + + + + +
+
+
+ +
+ +
+
+
+
+
+
+
+
+
+ + + +
+
+
+
+ 材料科学前沿
+
+
+
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
+ + 向上滑动看下一个 +
+
+
+
+ + + +
+ + +
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/backend/examples/database_example.go b/backend/examples/database_example.go new file mode 100644 index 0000000..7187562 --- /dev/null +++ b/backend/examples/database_example.go @@ -0,0 +1,246 @@ +package main + +import ( + "encoding/json" + "fmt" + "log" + + "github.com/wechat-crawler/pkg/database" +) + +func main() { + fmt.Println("==============================================") + fmt.Println(" 微信公众号文章数据库管理系统示例") + fmt.Println("==============================================\n") + + // 1. 初始化数据库 + db, err := database.InitDB("../data/wechat_articles.db") + if err != nil { + log.Fatal("数据库初始化失败:", err) + } + defer db.Close() + + // 2. 创建仓库实例 + officialRepo := database.NewOfficialAccountRepository(db) + articleRepo := database.NewArticleRepository(db) + contentRepo := database.NewArticleContentRepository(db) + + // 3. 示例:添加公众号 + fmt.Println("📝 示例1: 添加公众号信息") + official := &database.OfficialAccount{ + Biz: "MzI1NjEwMTM4OA==", + Nickname: "研招网资讯", + Homepage: "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzI1NjEwMTM4OA==&scene=124", + Description: "中国研究生招生信息网官方公众号", + } + + // 检查是否已存在 + existing, err := officialRepo.GetByBiz(official.Biz) + if err != nil { + log.Fatal("查询公众号失败:", err) + } + + var officialID int64 + if existing == nil { + // 不存在,创建新记录 + officialID, err = officialRepo.Create(official) + if err != nil { + log.Fatal("创建公众号失败:", err) + } + fmt.Printf("✅ 成功创建公众号: %s (ID: %d)\n\n", official.Nickname, officialID) + } else { + // 已存在 + officialID = existing.ID + fmt.Printf("ℹ️ 公众号已存在: %s (ID: %d)\n\n", existing.Nickname, officialID) + } + + // 4. 示例:添加文章 + fmt.Println("📝 示例2: 添加文章信息") + article := &database.Article{ + OfficialID: officialID, + Title: "专家分析2026年考研报名人数", + Author: "研招网资讯", + Link: "https://mp.weixin.qq.com/s?__biz=MzI1NjEwMTM4OA==&mid=2651232405&idx=1", + PublishTime: "2024-11-27 10:00:00", + CreateTime: "2024-11-27 15:30:00", + CommentID: "2247491372", + ReadNum: 15234, + LikeNum: 456, + ShareNum: 123, + ContentPreview: "根据最新统计数据显示,2026年全国硕士研究生报名人数预计将达到新高...", + ParagraphCount: 15, + } + + // 检查文章是否已存在 + existingArticle, err := articleRepo.GetByLink(article.Link) + if err != nil { + log.Fatal("查询文章失败:", err) + } + + var articleID int64 + if existingArticle == nil { + articleID, err = articleRepo.Create(article) + if err != nil { + log.Fatal("创建文章失败:", err) + } + fmt.Printf("✅ 成功创建文章: %s (ID: %d)\n\n", article.Title, articleID) + } else { + articleID = existingArticle.ID + fmt.Printf("ℹ️ 文章已存在: %s (ID: %d)\n\n", existingArticle.Title, articleID) + } + + // 5. 示例:添加文章内容 + fmt.Println("📝 示例3: 添加文章详细内容") + + paragraphs := []string{ + "根据最新统计数据显示,2026年全国硕士研究生报名人数预计将达到新高。", + "教育部相关负责人表示,随着社会对高层次人才需求的增加,考研热度持续上升。", + "专家建议考生理性选择,注重提升自身综合素质。", + } + + images := []string{ + "https://mmbiz.qpic.cn/mmbiz_jpg/xxx1.jpg", + "https://mmbiz.qpic.cn/mmbiz_jpg/xxx2.jpg", + } + + content := &database.ArticleContent{ + ArticleID: articleID, + HtmlContent: "
文章HTML内容
", + TextContent: "文章纯文本内容...", + Paragraphs: database.StringsToJSON(paragraphs), + Images: database.StringsToJSON(images), + } + + // 检查内容是否已存在 + existingContent, err := contentRepo.GetByArticleID(articleID) + if err != nil { + log.Fatal("查询文章内容失败:", err) + } + + if existingContent == nil { + contentID, err := contentRepo.Create(content) + if err != nil { + log.Fatal("创建文章内容失败:", err) + } + fmt.Printf("✅ 成功添加文章内容 (ID: %d)\n\n", contentID) + } else { + fmt.Printf("ℹ️ 文章内容已存在 (ID: %d)\n\n", existingContent.ID) + } + + // 6. 示例:查询文章列表 + fmt.Println("📋 示例4: 查询文章列表") + articles, total, err := articleRepo.List(officialID, 1, 10) + if err != nil { + log.Fatal("查询文章列表失败:", err) + } + + fmt.Printf("共找到 %d 篇文章:\n", total) + for i, item := range articles { + fmt.Printf("%d. %s (👁️ %d | 👍 %d)\n", i+1, item.Title, item.ReadNum, item.LikeNum) + } + fmt.Println() + + // 7. 示例:获取文章详情 + fmt.Println("📖 示例5: 获取文章详情") + detail, err := contentRepo.GetArticleDetail(articleID) + if err != nil { + log.Fatal("获取文章详情失败:", err) + } + + if detail != nil { + fmt.Printf("标题: %s\n", detail.Title) + fmt.Printf("作者: %s\n", detail.Author) + fmt.Printf("公众号: %s\n", detail.OfficialName) + fmt.Printf("发布时间: %s\n", detail.PublishTime) + fmt.Printf("阅读数: %d | 点赞数: %d\n", detail.ReadNum, detail.LikeNum) + fmt.Printf("段落数: %d\n", len(detail.Paragraphs)) + fmt.Printf("图片数: %d\n", len(detail.Images)) + if len(detail.Paragraphs) > 0 { + fmt.Printf("第一段: %s\n", detail.Paragraphs[0]) + } + } + fmt.Println() + + // 8. 示例:搜索文章 + fmt.Println("🔍 示例6: 搜索文章") + searchResults, searchTotal, err := articleRepo.Search("考研", 1, 10) + if err != nil { + log.Fatal("搜索文章失败:", err) + } + + fmt.Printf("搜索\"考研\"找到 %d 篇文章:\n", searchTotal) + for i, item := range searchResults { + fmt.Printf("%d. %s\n", i+1, item.Title) + } + fmt.Println() + + // 9. 示例:获取统计信息 + fmt.Println("📊 示例7: 获取统计信息") + stats, err := db.GetStatistics() + if err != nil { + log.Fatal("获取统计信息失败:", err) + } + + fmt.Printf("公众号总数: %d\n", stats.TotalOfficials) + fmt.Printf("文章总数: %d\n", stats.TotalArticles) + fmt.Printf("总阅读数: %d\n", stats.TotalReadNum) + fmt.Printf("总点赞数: %d\n", stats.TotalLikeNum) + fmt.Println() + + // 10. 示例:批量插入文章 + fmt.Println("📦 示例8: 批量插入文章") + batchArticles := []*database.Article{ + { + OfficialID: officialID, + Title: "教育部:2026年全国硕士研究生报名人数为343万", + Author: "研招网资讯", + Link: "https://mp.weixin.qq.com/s?__biz=MzI1NjEwMTM4OA==&mid=2651232406", + PublishTime: "2024-11-26 09:00:00", + ReadNum: 8965, + LikeNum: 234, + ContentPreview: "教育部公布2026年研究生招生数据...", + ParagraphCount: 12, + }, + { + OfficialID: officialID, + Title: "研考网上确认成功后,需重点关注四件事", + Author: "研招网资讯", + Link: "https://mp.weixin.qq.com/s?__biz=MzI1NjEwMTM4OA==&mid=2651232407", + PublishTime: "2024-11-25 15:30:00", + ReadNum: 6543, + LikeNum: 189, + ContentPreview: "网上确认通过后,考生还需要注意以下事项...", + ParagraphCount: 8, + }, + } + + err = articleRepo.BatchInsertArticles(batchArticles) + if err != nil { + log.Fatal("批量插入文章失败:", err) + } + fmt.Printf("✅ 成功批量插入 %d 篇文章\n\n", len(batchArticles)) + + // 11. 示例:导出JSON数据 + fmt.Println("💾 示例9: 导出文章列表为JSON") + allArticles, _, err := articleRepo.List(0, 1, 100) + if err != nil { + log.Fatal("查询文章列表失败:", err) + } + + jsonData, err := json.MarshalIndent(allArticles, "", " ") + if err != nil { + log.Fatal("JSON序列化失败:", err) + } + + fmt.Println("文章列表JSON (前200字符):") + if len(jsonData) > 200 { + fmt.Println(string(jsonData[:200]) + "...") + } else { + fmt.Println(string(jsonData)) + } + fmt.Println() + + fmt.Println("==============================================") + fmt.Println(" 数据库操作示例演示完成!") + fmt.Println("==============================================") +} diff --git a/backend/go.mod b/backend/go.mod deleted file mode 100644 index fa3be3b..0000000 --- a/backend/go.mod +++ /dev/null @@ -1,7 +0,0 @@ -module github.com/wechat-crawler - -go 1.20 - -require github.com/go-resty/resty/v2 v2.10.0 - -require golang.org/x/net v0.17.0 // indirect diff --git a/backend/go.sum b/backend/go.sum deleted file mode 100644 index 843de5e..0000000 --- a/backend/go.sum +++ /dev/null @@ -1,44 +0,0 @@ -github.com/go-resty/resty/v2 v2.10.0 h1:Qla4W/+TMmv0fOeeRqzEpXPLfTUnR5HZ1+lGs+CkiCo= -github.com/go-resty/resty/v2 v2.10.0/go.mod h1:iiP/OpA0CkcL3IGt1O0+/SIItFUbkkyw5BGXiVdTu+A= -github.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY= -golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w= -golang.org/x/crypto v0.0.0-20210921155107-089bfa567519/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc= -golang.org/x/crypto v0.14.0/go.mod h1:MVFd36DqK4CsrnJYDkBA3VC4m2GkXAM0PvzMCn4JQf4= -golang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4/go.mod h1:jJ57K6gSWd91VN4djpZkiMVwK6gcyfeH4XE8wZrZaV4= -golang.org/x/mod v0.8.0/go.mod h1:iBbtSCu2XBx23ZKBPSOrRkjjQPZFPuis4dIYUhu/chs= -golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= -golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg= -golang.org/x/net v0.0.0-20220722155237-a158d28d115b/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c= -golang.org/x/net v0.6.0/go.mod h1:2Tu9+aMcznHK/AK1HMvgo6xiTLG5rD5rZLDS+rp2Bjs= -golang.org/x/net v0.10.0/go.mod h1:0qNGK6F8kojg2nk9dLZ2mShWaEBan6FAoqfSigmmuDg= -golang.org/x/net v0.17.0 h1:pVaXccu2ozPjCXewfr1S7xza/zcXTity9cCdXQYSjIM= -golang.org/x/net v0.17.0/go.mod h1:NxSsAGuq816PNPmqtQdLE42eU2Fs7NoRIZrHJAlaCOE= -golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= -golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= -golang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= -golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= -golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= -golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.0.0-20220520151302-bc2c85ada10a/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.0.0-20220722155257-8c9f86f7a55f/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.8.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.13.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo= -golang.org/x/term v0.0.0-20210927222741-03fcf44c2211/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8= -golang.org/x/term v0.5.0/go.mod h1:jMB1sMXY+tzblOD4FWmEbocvup2/aLOaQEp7JmGp78k= -golang.org/x/term v0.8.0/go.mod h1:xPskH00ivmX89bAKVGSKKtLOWNx2+17Eiy94tnKShWo= -golang.org/x/term v0.13.0/go.mod h1:LTmsnFJwVN6bCy1rVCoS+qHT1HhALEFxKncY3WNNh4U= -golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ= -golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= -golang.org/x/text v0.3.7/go.mod h1:u+2+/6zg+i71rQMx5EYifcz6MCKuco9NR6JIITiCfzQ= -golang.org/x/text v0.7.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8= -golang.org/x/text v0.9.0/go.mod h1:e1OnstbJyHTd6l/uOt8jFFHp6TRDWZR/bV3emEE/zU8= -golang.org/x/text v0.13.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE= -golang.org/x/time v0.3.0 h1:rg5rLMjNzMS1RkNLzCG38eapWhnYLFYXDXj2gOlr8j4= -golang.org/x/time v0.3.0/go.mod h1:tRJNPiyCQ0inRvYxbN9jk5I+vvW/OXSQhTDSoE431IQ= -golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ= -golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= -golang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc= -golang.org/x/tools v0.6.0/go.mod h1:Xwgl3UAJ/d3gWutnCtw505GrjyAbvKui8lOU390QaIU= -golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= diff --git a/backend/main.exe b/backend/main.exe deleted file mode 100644 index 6df4060..0000000 Binary files a/backend/main.exe and /dev/null differ diff --git a/backend/main.exe~ b/backend/main.exe~ deleted file mode 100644 index c5d73f8..0000000 Binary files a/backend/main.exe~ and /dev/null differ diff --git a/backend/pkg/database/db.go b/backend/pkg/database/db.go new file mode 100644 index 0000000..fdfc341 --- /dev/null +++ b/backend/pkg/database/db.go @@ -0,0 +1,117 @@ +package database + +import ( + "database/sql" + "fmt" + "os" + "path/filepath" + + _ "modernc.org/sqlite" +) + +// DB 数据库实例 +type DB struct { + *sql.DB +} + +// InitDB 初始化数据库 +func InitDB(dbPath string) (*DB, error) { + // 确保数据库目录存在 + dbDir := filepath.Dir(dbPath) + if err := os.MkdirAll(dbDir, 0755); err != nil { + return nil, fmt.Errorf("创建数据库目录失败: %v", err) + } + + // 打开数据库连接(使用modernc.org/sqlite驱动) + db, err := sql.Open("sqlite", dbPath) + if err != nil { + return nil, fmt.Errorf("打开数据库失败: %v", err) + } + + // 测试连接 + if err := db.Ping(); err != nil { + return nil, fmt.Errorf("数据库连接测试失败: %v", err) + } + + // 创建表 + if err := createTables(db); err != nil { + return nil, fmt.Errorf("创建数据表失败: %v", err) + } + + fmt.Println("✅ 数据库初始化成功:", dbPath) + return &DB{db}, nil +} + +// createTables 创建数据表 +func createTables(db *sql.DB) error { + // 公众号表 + officialAccountTable := ` + CREATE TABLE IF NOT EXISTS official_accounts ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + biz TEXT NOT NULL UNIQUE, + nickname TEXT NOT NULL, + homepage TEXT, + description TEXT, + created_at DATETIME DEFAULT CURRENT_TIMESTAMP, + updated_at DATETIME DEFAULT CURRENT_TIMESTAMP + ); + CREATE INDEX IF NOT EXISTS idx_biz ON official_accounts(biz); + CREATE INDEX IF NOT EXISTS idx_nickname ON official_accounts(nickname); + ` + + // 文章表 + articleTable := ` + CREATE TABLE IF NOT EXISTS articles ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + official_id INTEGER NOT NULL, + title TEXT NOT NULL, + author TEXT, + link TEXT UNIQUE, + publish_time TEXT, + create_time TEXT, + comment_id TEXT, + read_num INTEGER DEFAULT 0, + like_num INTEGER DEFAULT 0, + share_num INTEGER DEFAULT 0, + content_preview TEXT, + paragraph_count INTEGER DEFAULT 0, + created_at DATETIME DEFAULT CURRENT_TIMESTAMP, + updated_at DATETIME DEFAULT CURRENT_TIMESTAMP, + FOREIGN KEY (official_id) REFERENCES official_accounts(id) + ); + CREATE INDEX IF NOT EXISTS idx_official_id ON articles(official_id); + CREATE INDEX IF NOT EXISTS idx_title ON articles(title); + CREATE INDEX IF NOT EXISTS idx_publish_time ON articles(publish_time); + CREATE INDEX IF NOT EXISTS idx_link ON articles(link); + ` + + // 文章内容表 + articleContentTable := ` + CREATE TABLE IF NOT EXISTS article_contents ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + article_id INTEGER NOT NULL UNIQUE, + html_content TEXT, + text_content TEXT, + paragraphs TEXT, + images TEXT, + created_at DATETIME DEFAULT CURRENT_TIMESTAMP, + FOREIGN KEY (article_id) REFERENCES articles(id) ON DELETE CASCADE + ); + CREATE INDEX IF NOT EXISTS idx_article_id ON article_contents(article_id); + ` + + // 执行创建表语句 + tables := []string{officialAccountTable, articleTable, articleContentTable} + for _, table := range tables { + if _, err := db.Exec(table); err != nil { + return err + } + } + + return nil +} + +// Close 关闭数据库连接 +func (db *DB) Close() error { + return db.DB.Close() +} diff --git a/backend/pkg/database/models.go b/backend/pkg/database/models.go new file mode 100644 index 0000000..3fcd556 --- /dev/null +++ b/backend/pkg/database/models.go @@ -0,0 +1,76 @@ +package database + +import ( + "time" +) + +// OfficialAccount 公众号信息 +type OfficialAccount struct { + ID int64 `json:"id"` + Biz string `json:"biz"` // 公众号唯一标识 + Nickname string `json:"nickname"` // 公众号名称 + Homepage string `json:"homepage"` // 公众号主页链接 + Description string `json:"description"` // 公众号描述 + CreatedAt time.Time `json:"created_at"` // 创建时间 + UpdatedAt time.Time `json:"updated_at"` // 更新时间 +} + +// Article 文章信息 +type Article struct { + ID int64 `json:"id"` + OfficialID int64 `json:"official_id"` // 关联的公众号ID + Title string `json:"title"` // 文章标题 + Author string `json:"author"` // 作者 + Link string `json:"link"` // 文章链接 + PublishTime string `json:"publish_time"` // 发布时间 + CreateTime string `json:"create_time"` // 创建时间(抓取时间) + CommentID string `json:"comment_id"` // 评论ID + ReadNum int `json:"read_num"` // 阅读数 + LikeNum int `json:"like_num"` // 点赞数 + ShareNum int `json:"share_num"` // 分享数 + ContentPreview string `json:"content_preview"` // 内容预览(前200字) + ParagraphCount int `json:"paragraph_count"` // 段落数 + CreatedAt time.Time `json:"created_at"` // 数据库创建时间 + UpdatedAt time.Time `json:"updated_at"` // 数据库更新时间 +} + +// ArticleContent 文章详细内容 +type ArticleContent struct { + ID int64 `json:"id"` + ArticleID int64 `json:"article_id"` // 关联的文章ID + HtmlContent string `json:"html_content"` // HTML原始内容 + TextContent string `json:"text_content"` // 纯文本内容 + Paragraphs string `json:"paragraphs"` // 段落内容(JSON数组) + Images string `json:"images"` // 图片链接(JSON数组) + CreatedAt time.Time `json:"created_at"` // 创建时间 +} + +// ArticleListItem 文章列表项(用于API返回) +type ArticleListItem struct { + ID int64 `json:"id"` + Title string `json:"title"` + Author string `json:"author"` + PublishTime string `json:"publish_time"` + ReadNum int `json:"read_num"` + LikeNum int `json:"like_num"` + OfficialName string `json:"official_name"` + ContentPreview string `json:"content_preview"` +} + +// ArticleDetail 文章详情(用于API返回) +type ArticleDetail struct { + Article + OfficialName string `json:"official_name"` + HtmlContent string `json:"html_content"` + TextContent string `json:"text_content"` + Paragraphs []string `json:"paragraphs"` + Images []string `json:"images"` +} + +// Statistics 统计信息 +type Statistics struct { + TotalOfficials int `json:"total_officials"` // 公众号总数 + TotalArticles int `json:"total_articles"` // 文章总数 + TotalReadNum int `json:"total_read_num"` // 总阅读数 + TotalLikeNum int `json:"total_like_num"` // 总点赞数 +} diff --git a/backend/pkg/database/repository.go b/backend/pkg/database/repository.go new file mode 100644 index 0000000..332f858 --- /dev/null +++ b/backend/pkg/database/repository.go @@ -0,0 +1,455 @@ +package database + +import ( + "database/sql" + "encoding/json" + "fmt" + "strings" +) + +// OfficialAccountRepository 公众号数据仓库 +type OfficialAccountRepository struct { + db *DB +} + +// NewOfficialAccountRepository 创建公众号仓库 +func NewOfficialAccountRepository(db *DB) *OfficialAccountRepository { + return &OfficialAccountRepository{db: db} +} + +// Create 创建公众号 +func (r *OfficialAccountRepository) Create(account *OfficialAccount) (int64, error) { + result, err := r.db.Exec(` + INSERT INTO official_accounts (biz, nickname, homepage, description) + VALUES (?, ?, ?, ?) + `, account.Biz, account.Nickname, account.Homepage, account.Description) + + if err != nil { + return 0, err + } + + return result.LastInsertId() +} + +// GetByBiz 根据Biz获取公众号 +func (r *OfficialAccountRepository) GetByBiz(biz string) (*OfficialAccount, error) { + account := &OfficialAccount{} + err := r.db.QueryRow(` + SELECT id, biz, nickname, homepage, description, created_at, updated_at + FROM official_accounts WHERE biz = ? + `, biz).Scan(&account.ID, &account.Biz, &account.Nickname, &account.Homepage, + &account.Description, &account.CreatedAt, &account.UpdatedAt) + + if err == sql.ErrNoRows { + return nil, nil + } + if err != nil { + return nil, err + } + + return account, nil +} + +// GetByID 根据ID获取公众号 +func (r *OfficialAccountRepository) GetByID(id int64) (*OfficialAccount, error) { + account := &OfficialAccount{} + err := r.db.QueryRow(` + SELECT id, biz, nickname, homepage, description, created_at, updated_at + FROM official_accounts WHERE id = ? + `, id).Scan(&account.ID, &account.Biz, &account.Nickname, &account.Homepage, + &account.Description, &account.CreatedAt, &account.UpdatedAt) + + if err == sql.ErrNoRows { + return nil, nil + } + if err != nil { + return nil, err + } + + return account, nil +} + +// List 获取所有公众号列表 +func (r *OfficialAccountRepository) List() ([]*OfficialAccount, error) { + rows, err := r.db.Query(` + SELECT id, biz, nickname, homepage, description, created_at, updated_at + FROM official_accounts ORDER BY created_at DESC + `) + if err != nil { + return nil, err + } + defer rows.Close() + + var accounts []*OfficialAccount + for rows.Next() { + account := &OfficialAccount{} + err := rows.Scan(&account.ID, &account.Biz, &account.Nickname, &account.Homepage, + &account.Description, &account.CreatedAt, &account.UpdatedAt) + if err != nil { + return nil, err + } + accounts = append(accounts, account) + } + + return accounts, nil +} + +// Update 更新公众号信息 +func (r *OfficialAccountRepository) Update(account *OfficialAccount) error { + _, err := r.db.Exec(` + UPDATE official_accounts + SET nickname = ?, homepage = ?, description = ?, updated_at = CURRENT_TIMESTAMP + WHERE id = ? + `, account.Nickname, account.Homepage, account.Description, account.ID) + + return err +} + +// ArticleRepository 文章数据仓库 +type ArticleRepository struct { + db *DB +} + +// NewArticleRepository 创建文章仓库 +func NewArticleRepository(db *DB) *ArticleRepository { + return &ArticleRepository{db: db} +} + +// Create 创建文章 +func (r *ArticleRepository) Create(article *Article) (int64, error) { + result, err := r.db.Exec(` + INSERT INTO articles ( + official_id, title, author, link, publish_time, create_time, + comment_id, read_num, like_num, share_num, content_preview, paragraph_count + ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + `, article.OfficialID, article.Title, article.Author, article.Link, + article.PublishTime, article.CreateTime, article.CommentID, + article.ReadNum, article.LikeNum, article.ShareNum, + article.ContentPreview, article.ParagraphCount) + + if err != nil { + return 0, err + } + + return result.LastInsertId() +} + +// GetByID 根据ID获取文章 +func (r *ArticleRepository) GetByID(id int64) (*Article, error) { + article := &Article{} + err := r.db.QueryRow(` + SELECT id, official_id, title, author, link, publish_time, create_time, + comment_id, read_num, like_num, share_num, content_preview, + paragraph_count, created_at, updated_at + FROM articles WHERE id = ? + `, id).Scan(&article.ID, &article.OfficialID, &article.Title, &article.Author, + &article.Link, &article.PublishTime, &article.CreateTime, &article.CommentID, + &article.ReadNum, &article.LikeNum, &article.ShareNum, &article.ContentPreview, + &article.ParagraphCount, &article.CreatedAt, &article.UpdatedAt) + + if err == sql.ErrNoRows { + return nil, nil + } + if err != nil { + return nil, err + } + + return article, nil +} + +// GetByLink 根据链接获取文章 +func (r *ArticleRepository) GetByLink(link string) (*Article, error) { + article := &Article{} + err := r.db.QueryRow(` + SELECT id, official_id, title, author, link, publish_time, create_time, + comment_id, read_num, like_num, share_num, content_preview, + paragraph_count, created_at, updated_at + FROM articles WHERE link = ? + `, link).Scan(&article.ID, &article.OfficialID, &article.Title, &article.Author, + &article.Link, &article.PublishTime, &article.CreateTime, &article.CommentID, + &article.ReadNum, &article.LikeNum, &article.ShareNum, &article.ContentPreview, + &article.ParagraphCount, &article.CreatedAt, &article.UpdatedAt) + + if err == sql.ErrNoRows { + return nil, nil + } + if err != nil { + return nil, err + } + + return article, nil +} + +// List 获取文章列表(分页) +func (r *ArticleRepository) List(officialID int64, page, pageSize int) ([]*ArticleListItem, int, error) { + // 构建查询条件 + whereClause := "" + args := []interface{}{} + + if officialID > 0 { + whereClause = "WHERE a.official_id = ?" + args = append(args, officialID) + } + + // 获取总数 + countQuery := fmt.Sprintf("SELECT COUNT(*) FROM articles a %s", whereClause) + var total int + err := r.db.QueryRow(countQuery, args...).Scan(&total) + if err != nil { + return nil, 0, err + } + + // 获取列表 + offset := (page - 1) * pageSize + listQuery := fmt.Sprintf(` + SELECT a.id, a.title, a.author, a.publish_time, a.read_num, a.like_num, + a.content_preview, o.nickname + FROM articles a + LEFT JOIN official_accounts o ON a.official_id = o.id + %s + ORDER BY a.publish_time DESC + LIMIT ? OFFSET ? + `, whereClause) + + args = append(args, pageSize, offset) + rows, err := r.db.Query(listQuery, args...) + if err != nil { + return nil, 0, err + } + defer rows.Close() + + var items []*ArticleListItem + for rows.Next() { + item := &ArticleListItem{} + err := rows.Scan(&item.ID, &item.Title, &item.Author, &item.PublishTime, + &item.ReadNum, &item.LikeNum, &item.ContentPreview, &item.OfficialName) + if err != nil { + return nil, 0, err + } + items = append(items, item) + } + + return items, total, nil +} + +// Search 搜索文章 +func (r *ArticleRepository) Search(keyword string, page, pageSize int) ([]*ArticleListItem, int, error) { + keyword = "%" + keyword + "%" + + // 获取总数 + var total int + err := r.db.QueryRow(` + SELECT COUNT(*) FROM articles WHERE title LIKE ? OR author LIKE ? + `, keyword, keyword).Scan(&total) + if err != nil { + return nil, 0, err + } + + // 获取列表 + offset := (page - 1) * pageSize + rows, err := r.db.Query(` + SELECT a.id, a.title, a.author, a.publish_time, a.read_num, a.like_num, + a.content_preview, o.nickname + FROM articles a + LEFT JOIN official_accounts o ON a.official_id = o.id + WHERE a.title LIKE ? OR a.author LIKE ? + ORDER BY a.publish_time DESC + LIMIT ? OFFSET ? + `, keyword, keyword, pageSize, offset) + if err != nil { + return nil, 0, err + } + defer rows.Close() + + var items []*ArticleListItem + for rows.Next() { + item := &ArticleListItem{} + err := rows.Scan(&item.ID, &item.Title, &item.Author, &item.PublishTime, + &item.ReadNum, &item.LikeNum, &item.ContentPreview, &item.OfficialName) + if err != nil { + return nil, 0, err + } + items = append(items, item) + } + + return items, total, nil +} + +// Update 更新文章信息 +func (r *ArticleRepository) Update(article *Article) error { + _, err := r.db.Exec(` + UPDATE articles + SET read_num = ?, like_num = ?, share_num = ?, updated_at = CURRENT_TIMESTAMP + WHERE id = ? + `, article.ReadNum, article.LikeNum, article.ShareNum, article.ID) + + return err +} + +// ArticleContentRepository 文章内容数据仓库 +type ArticleContentRepository struct { + db *DB +} + +// NewArticleContentRepository 创建文章内容仓库 +func NewArticleContentRepository(db *DB) *ArticleContentRepository { + return &ArticleContentRepository{db: db} +} + +// Create 创建文章内容 +func (r *ArticleContentRepository) Create(content *ArticleContent) (int64, error) { + result, err := r.db.Exec(` + INSERT INTO article_contents (article_id, html_content, text_content, paragraphs, images) + VALUES (?, ?, ?, ?, ?) + `, content.ArticleID, content.HtmlContent, content.TextContent, + content.Paragraphs, content.Images) + + if err != nil { + return 0, err + } + + return result.LastInsertId() +} + +// GetByArticleID 根据文章ID获取内容 +func (r *ArticleContentRepository) GetByArticleID(articleID int64) (*ArticleContent, error) { + content := &ArticleContent{} + err := r.db.QueryRow(` + SELECT id, article_id, html_content, text_content, paragraphs, images, created_at + FROM article_contents WHERE article_id = ? + `, articleID).Scan(&content.ID, &content.ArticleID, &content.HtmlContent, + &content.TextContent, &content.Paragraphs, &content.Images, &content.CreatedAt) + + if err == sql.ErrNoRows { + return nil, nil + } + if err != nil { + return nil, err + } + + return content, nil +} + +// GetArticleDetail 获取文章详情(包含内容) +func (r *ArticleContentRepository) GetArticleDetail(articleID int64) (*ArticleDetail, error) { + detail := &ArticleDetail{} + var paragraphsJSON, imagesJSON string + + err := r.db.QueryRow(` + SELECT a.id, a.official_id, a.title, a.author, a.link, a.publish_time, + a.create_time, a.comment_id, a.read_num, a.like_num, a.share_num, + a.content_preview, a.paragraph_count, a.created_at, a.updated_at, + o.nickname, c.html_content, c.text_content, c.paragraphs, c.images + FROM articles a + LEFT JOIN official_accounts o ON a.official_id = o.id + LEFT JOIN article_contents c ON a.id = c.article_id + WHERE a.id = ? + `, articleID).Scan( + &detail.ID, &detail.OfficialID, &detail.Title, &detail.Author, + &detail.Link, &detail.PublishTime, &detail.CreateTime, &detail.CommentID, + &detail.ReadNum, &detail.LikeNum, &detail.ShareNum, &detail.ContentPreview, + &detail.ParagraphCount, &detail.CreatedAt, &detail.UpdatedAt, + &detail.OfficialName, &detail.HtmlContent, &detail.TextContent, + ¶graphsJSON, &imagesJSON, + ) + + if err == sql.ErrNoRows { + return nil, nil + } + if err != nil { + return nil, err + } + + // 解析JSON数组 + if paragraphsJSON != "" { + json.Unmarshal([]byte(paragraphsJSON), &detail.Paragraphs) + } + if imagesJSON != "" { + json.Unmarshal([]byte(imagesJSON), &detail.Images) + } + + return detail, nil +} + +// GetStatistics 获取统计信息 +func (db *DB) GetStatistics() (*Statistics, error) { + stats := &Statistics{} + + err := db.QueryRow(` + SELECT + (SELECT COUNT(*) FROM official_accounts) as total_officials, + (SELECT COUNT(*) FROM articles) as total_articles, + (SELECT COALESCE(SUM(read_num), 0) FROM articles) as total_read_num, + (SELECT COALESCE(SUM(like_num), 0) FROM articles) as total_like_num + `).Scan(&stats.TotalOfficials, &stats.TotalArticles, &stats.TotalReadNum, &stats.TotalLikeNum) + + if err != nil { + return nil, err + } + + return stats, nil +} + +// BatchInsertArticles 批量插入文章 +func (r *ArticleRepository) BatchInsertArticles(articles []*Article) error { + if len(articles) == 0 { + return nil + } + + // 开始事务 + tx, err := r.db.Begin() + if err != nil { + return err + } + defer tx.Rollback() + + stmt, err := tx.Prepare(` + INSERT OR IGNORE INTO articles ( + official_id, title, author, link, publish_time, create_time, + comment_id, read_num, like_num, share_num, content_preview, paragraph_count + ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + `) + if err != nil { + return err + } + defer stmt.Close() + + for _, article := range articles { + _, err = stmt.Exec( + article.OfficialID, article.Title, article.Author, article.Link, + article.PublishTime, article.CreateTime, article.CommentID, + article.ReadNum, article.LikeNum, article.ShareNum, + article.ContentPreview, article.ParagraphCount, + ) + if err != nil { + return err + } + } + + return tx.Commit() +} + +// Helper function: 将字符串数组转换为JSON字符串 +func StringsToJSON(strs []string) string { + if len(strs) == 0 { + return "[]" + } + data, _ := json.Marshal(strs) + return string(data) +} + +// Helper function: 生成内容预览 +func GeneratePreview(content string, maxLen int) string { + if len(content) <= maxLen { + return content + } + // 移除换行符和多余空格 + content = strings.ReplaceAll(content, "\n", " ") + content = strings.ReplaceAll(content, "\r", "") + content = strings.Join(strings.Fields(content), " ") + + if len(content) <= maxLen { + return content + } + return content[:maxLen] + "..." +} diff --git a/backend/pkg/wechat/access_articles.go b/backend/pkg/wechat/access_articles.go index 9838bcc..dedba0f 100644 --- a/backend/pkg/wechat/access_articles.go +++ b/backend/pkg/wechat/access_articles.go @@ -88,52 +88,86 @@ func NewSimpleCrawler() *WechatCrawler { // GetOfficialAccountName 获取公众号名称 func (w *WechatCrawler) GetOfficialAccountName() (string, error) { - url := fmt.Sprintf("https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=%s&scene=124", w.biz) - resp, err := w.client.R().Get(url) + // 如果有登录凭证,使用带认证的请求(更可靠) + var url string + if w.uin != "" && w.key != "" && w.passTicket != "" { + // 带登录信息的请求,可以绕过验证页面 + url = fmt.Sprintf("https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=%s&scene=124&uin=%s&key=%s&pass_ticket=%s", + w.biz, w.uin, w.key, w.passTicket) + } else { + // 不带登录信息的请求(可能会遇到验证页面) + url = fmt.Sprintf("https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=%s&scene=124", w.biz) + } + + // 设置更完整的请求头,模拟真实浏览器 + resp, err := w.client.R(). + SetHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"). + SetHeader("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8"). + SetHeader("Cache-Control", "max-age=0"). + SetHeader("Upgrade-Insecure-Requests", "1"). + SetHeader("Referer", "https://mp.weixin.qq.com/"). + Get(url) + if err != nil { return "", fmt.Errorf("获取公众号信息失败: %v", err) } + // 检查 HTTP 状态码 + if resp.StatusCode() != 200 { + return "", fmt.Errorf("获取公众号信息失败: HTTP状态码 %d", resp.StatusCode()) + } + content := resp.String() + // 调试:检查响应内容的前500字符 + if len(content) < 100 { + return "", fmt.Errorf("响应内容过短,可能是请求失败: %s", content) + } + // 尝试多种正则表达式模式来提取公众号名称 - // 模式1: 匹配格式: var nickname = "公众号名称".html(false) || ""; - nicknameRegex := regexp.MustCompile(`var nickname = "([^"]+)"\.html\(false\)\s*\|\|\s*""`) - match := nicknameRegex.FindStringSubmatch(content) - if len(match) >= 2 { - return match[1], nil - } - // 模式2: 原始模式 - nicknameRegex2 := regexp.MustCompile(`var nickname = "(.*?)";`) - match = nicknameRegex2.FindStringSubmatch(content) - if len(match) >= 2 { - return match[1], nil + // 优先级顺序:var nickname > JSON格式 > HTML title + patterns := []struct { + pattern string + desc string + }{ + {`var nickname\s*=\s*['"](. +?)['']`, "var nickname变量"}, + {`var nickname = "([^"]+)"\.html\(false\)\s*\|\|\s*""`, "var nickname(带html方法)"}, + {`var nickname = "(.*?)";`, "var nickname原始模式"}, + {`nickname\s*:\s*"([^"]+)"`, "JSON格式nickname"}, + {`"nickname":"([^"]+)"`, "字符串格式nickname"}, + {`([^<]+)<\/title>`, "HTML标题"}, } - // 模式3: JSON格式 - nicknameRegex3 := regexp.MustCompile(`nickname\s*:\s*"([^"]+)"`) - match = nicknameRegex3.FindStringSubmatch(content) - if len(match) >= 2 { - return match[1], nil - } - - // 模式4: 字符串格式 - nicknameRegex4 := regexp.MustCompile(`"nickname":"([^"]+)"`) - match = nicknameRegex4.FindStringSubmatch(content) - if len(match) >= 2 { - return match[1], nil - } - - // 模式5: HTML标题 - nicknameRegex5 := regexp.MustCompile(`<title>([^<]+)<\/title>`) - match = nicknameRegex5.FindStringSubmatch(content) - if len(match) >= 2 { - // 清理标题,移除"- 微信公众号"等后缀 - title := match[1] - if idx := strings.Index(title, "-"); idx > 0 { - title = strings.TrimSpace(title[:idx]) + for _, p := range patterns { + re := regexp.MustCompile(p.pattern) + match := re.FindStringSubmatch(content) + if len(match) >= 2 { + nickname := match[1] + // 如果是从 HTML title 提取的,需要清理 + if p.desc == "HTML标题" { + // 清理标题,移除"- 微信公众号"等后缀 + if idx := strings.Index(nickname, "-"); idx > 0 { + nickname = strings.TrimSpace(nickname[:idx]) + } + // 如果提取到的是"验证",说明遇到了验证页面 + // 返回更详细的错误信息,包括可能的解决方案 + if nickname == "验证" { + return "", fmt.Errorf("遇到验证页面,Cookie可能已过期\n" + + "解决方案:\n" + + "1. 在浏览器中重新登录微信公众号平台\n" + + "2. 访问目标公众号主页\n" + + "3. 向下滚动加载文章列表\n" + + "4. 在Fiddler中重新抓取包含所有参数的URL") + } + } + // 成功提取,返回结果 + return nickname, nil } - return title, nil + } + + // 如果所有模式都失败,检查是否是验证页面 + if strings.Contains(content, "当前环境异常") || strings.Contains(content, "完成验证后即可继续访问") { + return "", fmt.Errorf("遇到人机验证页面,请在浏览器中完成验证后重新获取Cookie") } // 如果所有模式都失败,尝试从biz生成一个有意义的名称 @@ -151,7 +185,7 @@ func (w *WechatCrawler) GetNextList(offset int) (map[string]interface{}, error) return nil, fmt.Errorf("no session: 需要提供微信登录状态的cookies\n请在浏览器中登录微信公众号平台后,从URL中获取uin、key和pass_ticket参数") } - url := fmt.Sprintf("https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=%s&offset=%d&count=10&f=json&uin=%s&key=%s&pass_ticket=%s&appmsg_token=999999999&x5=0&f=json", + url := fmt.Sprintf("https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=%s&f=json&offset=%d&count=10&is_ok=1&scene=124&uin=%s&key=%s&pass_ticket=%s&wxtoken=&appmsg_token=&x5=0&f=json", w.biz, offset*10, w.uin, w.key, w.passTicket) resp, err := w.client.R().SetHeader("Referer", fmt.Sprintf("https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=%s&scene=124", w.biz)).Get(url) @@ -208,12 +242,22 @@ func (w *WechatCrawler) GetNextList(offset int) (map[string]interface{}, error) return nil, fmt.Errorf("解析文章列表格式错误") } + // 调试:打印原始 JSON 的前 500 字符 + if len(generalMsgList) > 0 { + preview := generalMsgList + if len(preview) > 500 { + preview = preview[:500] + } + fmt.Printf("\n调试 - general_msg_list 前500字符:\n%s...\n\n", preview) + } + var msgList struct { List []struct { CommMsgInfo struct { ID int64 `json:"id"` Type int `json:"type"` - CreateTime int64 `json:"create_time"` + DateTime int64 `json:"datetime"` // 微信使用datetime字段,不是create_time + CreateTime int64 `json:"create_time"` // 保留兼容性 SourceMsgID int64 `json:"source_msg_id"` } `json:"comm_msg_info"` AppMsgExtInfo struct { @@ -241,6 +285,28 @@ func (w *WechatCrawler) GetNextList(offset int) (map[string]interface{}, error) return nil, fmt.Errorf("解析文章列表内容失败: %v", err) } + // 调试:打印第一篇文章的原始数据 + if len(msgList.List) > 0 { + fmt.Printf("\n调试 - 第一篇文章的原始JSON数据:\n") + firstItem := msgList.List[0] + fmt.Printf(" Type: %d\n", firstItem.CommMsgInfo.Type) + fmt.Printf(" DateTime: %d\n", firstItem.CommMsgInfo.DateTime) + fmt.Printf(" CreateTime: %d\n", firstItem.CommMsgInfo.CreateTime) + fmt.Printf(" ID: %d\n", firstItem.CommMsgInfo.ID) + fmt.Printf(" Title: %s\n", firstItem.AppMsgExtInfo.Title) + fmt.Printf(" Author: %s\n", firstItem.AppMsgExtInfo.Author) + + // 显示实际使用的时间戳 + timestamp := firstItem.CommMsgInfo.DateTime + if timestamp == 0 { + timestamp = firstItem.CommMsgInfo.CreateTime + } + if timestamp > 0 { + fmt.Printf(" 实际使用的时间戳: %d (%s)\n", timestamp, time.Unix(timestamp, 0).Format("2006-01-02 15:04:05")) + } + fmt.Println() + } + // 构建返回数据 response := make(map[string]interface{}) response["m_flag"] = 1 @@ -248,8 +314,12 @@ func (w *WechatCrawler) GetNextList(offset int) (map[string]interface{}, error) var passageList [][]string for _, item := range msgList.List { if item.CommMsgInfo.Type == 49 { - // 单图文消息 - createTime := fmt.Sprintf("%d", item.CommMsgInfo.CreateTime) + // 获取时间戳,优先使用DateTime,如果为0则使用CreateTime + timestamp := item.CommMsgInfo.DateTime + if timestamp == 0 { + timestamp = item.CommMsgInfo.CreateTime + } + createTime := fmt.Sprintf("%d", timestamp) title := item.AppMsgExtInfo.Title link := item.AppMsgExtInfo.ContentURL passageList = append(passageList, []string{"", createTime, title, link}) @@ -283,13 +353,14 @@ func (w *WechatCrawler) GetOneArticle(link string) (string, error) { // ExtractOfficialAccountName 从文章内容中提取公众号名称 func (w *WechatCrawler) ExtractOfficialAccountName(content string) string { accountName := "" - // 优先从微信文章特定的字段提取公众号名称 + // 参考 Python 版本,优先从 var nickname 提取公众号名称 patterns := []string{ - `window\.appmsg\s*=\s*\{[^}]*"author"\s*:\s*['"](.*?)['"]`, // window.appmsg.author - `var nickname\s*=\s*['"](.*?)['"]`, // nickname变量 - `"nickname"\s*:\s*['"](.*?)['"]`, // JSON中的nickname字段 - `var ct\s*=\s*['"](.*?)['"]`, // ct变量(有时用于存储公众号名称) - `<meta[^>]*name=["']?author["']?[^>]*content=["'](.*?)["']`, // meta标签中的作者信息 + `var nickname\s*=\s*['"](.+?)['"]`, // nickname变量(Python版本的主要模式) + `var nickname.*"(.*?)"`, // nickname变量备用模式 + `"nickname"\s*:\s*['"](.+?)['"]`, // JSON中的nickname字段 + `window\.appmsg\s*=\s*\{[^}]*"author"\s*:\s*['"](.+?)['']`, // window.appmsg.author + `var ct\s*=\s*['"](.+?)['"]`, // ct变量(有时用于存储公众号名称) + `<meta[^>]*name=["']?author["']?[^>]*content=["'](.+?)["]`, // meta标签中的作者信息 } for _, pattern := range patterns { @@ -309,7 +380,11 @@ func (w *WechatCrawler) ExtractOfficialAccountName(content string) string { break } } - break + // 去除可能存在的空格和特殊字符 + accountName = strings.TrimSpace(accountName) + if accountName != "" { + break + } } } @@ -318,38 +393,54 @@ func (w *WechatCrawler) ExtractOfficialAccountName(content string) string { // ExtractArticleInfo 从文章内容中提取关键信息 func (w *WechatCrawler) ExtractArticleInfo(content string) (string, string, string, string, string, []string) { - // 提取创建时间 - 增强版,增加对ori_create_time的支持 + // 首先提取公众号名称,用于后续标题验证 + accountName := w.ExtractOfficialAccountName(content) + + // 提取创建时间 - 参考 Python 版本 createTime := "" - // 模式1: 标准createTime变量 - createTimeRegex := regexp.MustCompile(`var createTime\s*=\s*['"](\d+)['"]`) + // 模式1: 标准createTime变量(Python版本的主要模式) + createTimeRegex := regexp.MustCompile(`var createTime = '(.+?)'`) if match := createTimeRegex.FindStringSubmatch(content); len(match) > 1 { createTime = match[1] } else { - // 模式2: ori_create_time变量(在之前的文件中发现) - oriCreateTimeRegex := regexp.MustCompile(`ori_create_time\s*:\s*['"](\d+)['"]`) - if match := oriCreateTimeRegex.FindStringSubmatch(content); len(match) > 1 { - createTime = match[1] - } - // 模式3: JSON对象中的create_time字段 - jsonCreateTimeRegex := regexp.MustCompile(`"create_time"\s*:\s*(\d+)`) - if match := jsonCreateTimeRegex.FindStringSubmatch(content); len(match) > 1 { + // 模式2: 双引号格式 + createTimeRegex2 := regexp.MustCompile(`var createTime\s*=\s*['"](.+?)['"]`) + if match := createTimeRegex2.FindStringSubmatch(content); len(match) > 1 { createTime = match[1] + } else { + // 模式3: ori_create_time变量(在之前的文件中发现) + oriCreateTimeRegex := regexp.MustCompile(`ori_create_time\s*:\s*['"](.+?)['"]`) + if match := oriCreateTimeRegex.FindStringSubmatch(content); len(match) > 1 { + createTime = match[1] + } else { + // 模式4: JSON对象中的create_time字段 + jsonCreateTimeRegex := regexp.MustCompile(`"create_time"\s*:\s*(.+?)(?:,|\})`) + if match := jsonCreateTimeRegex.FindStringSubmatch(content); len(match) > 1 { + createTime = match[1] + // 去除引号 + createTime = strings.Trim(createTime, `"'`) + } + } } } - // 提取标题 - 增强版,优化标题提取逻辑,确保正确区分公众号名称和文章标题 + // 提取标题 - 参考 Python 版本,支持单引号和双引号 title := "" - // 优先从微信文章特有的结构提取标题(window.appmsg.title优先级最高) + // 优先级顺序: + // 1. var msg_title - 微信文章真正的标题字段(最高优先级) + // 2. meta 标签中的 og:title 或 twitter:title + // 3. var title - 可能是公众号名称 titlePatterns := []string{ - `window\.appmsg\s*=\s*\{[^}]*"title"\s*:\s*['"](.*?)['"]`, // window.appmsg对象中的title(微信文章标准标题位置) - `var title\s*=\s*['"](.*?)['"]`, // 直接变量赋值 - `"title"\s*:\s*['"](.*?)['"]`, // JSON对象中的title字段 - `window\.title\s*=\s*['"](.*?)['"]`, // window.title赋值 - // 增加JsDecode函数支持(在文件中发现) - `title\s*=\s*JsDecode\(['"](.*?)['"]\)`, // title变量的JsDecode赋值 - `JsDecode\(['"]([^'"]*?title[^'"]*)['"]\)`, // 包含title的JsDecode调用 - // HTML title标签优先级降低,因为可能包含公众号名称 - `<title[^>]*>(.*?)`, + `var msg_title\s*=\s*['"](.+?)['"]`, // msg_title是真正的文章标题! + `]*>(.+?)`, // HTML title标签(最低优先级) } for _, pattern := range titlePatterns { @@ -369,7 +460,12 @@ func (w *WechatCrawler) ExtractArticleInfo(content string) (string, string, stri break } } - break + + // 验证:如果提取的标题与公众号名称相同,继续尝试下一个模式 + // 这是因为HTML title标签通常包含公众号名称 + if title != accountName && title != "" { + break + } } } @@ -448,17 +544,23 @@ func (w *WechatCrawler) ExtractArticleInfo(content string) (string, string, stri } } - // 方法2: 从HTML DOM结构中直接提取(次优先级) + // 方法2: 从 HTML DOM 结构中直接提取(次优先级) if rawContent == "" { - // 2.1 优先查找rich_media_content类的div(微信文章核心内容容器) - richMediaClassRegex := regexp.MustCompile(`(?s)]*>([\s\S]*?)<\/div>`) - if match := richMediaClassRegex.FindStringSubmatch(content); len(match) > 1 { + // 2.1 优先查找 id=img-content 的div(微信新版本文章容器) + imgContentIdRegex := regexp.MustCompile(`(?s)]*>([\s\S]*?)`) + if match := imgContentIdRegex.FindStringSubmatch(content); len(match) > 1 { rawContent = match[1] } else if rawContent == "" { - // 2.2 尝试查找id为js_content的元素 - jsContentIdRegex := regexp.MustCompile(`(?s)]*>([\s\S]*?)<\/div>`) - if match := jsContentIdRegex.FindStringSubmatch(content); len(match) > 1 { + // 2.2 查找rich_media_content类的div(微信文章核心内容容器) + richMediaClassRegex := regexp.MustCompile(`(?s)]*>([\s\S]*?)`) + if match := richMediaClassRegex.FindStringSubmatch(content); len(match) > 1 { rawContent = match[1] + } else if rawContent == "" { + // 2.3 尝试查找id为js_content的元素 + jsContentIdRegex := regexp.MustCompile(`(?s)]*>([\s\S]*?)`) + if match := jsContentIdRegex.FindStringSubmatch(content); len(match) > 1 { + rawContent = match[1] + } } } } @@ -509,9 +611,11 @@ func (w *WechatCrawler) ExtractArticleInfo(content string) (string, string, stri // 方法5: 尝试从微信文章特有的段落结构提取 if rawContent == "" { - // 查找带有rich_media_p类的p标签(微信文章特有的段落样式) - pTagsRegex := regexp.MustCompile(`(?s)([\s\S]*?)<\/p>`) - if matches := pTagsRegex.FindAllStringSubmatch(content, -1); len(matches) > 0 { + // Python版本使用 BeautifulSoup 的 getText() 方法提取所有文本 + // 这里我们直接提取所有段落,然后过滤JavaScript + // 查找带有data-pm-slice或js_darkmode类的p标签(微信文章特有样式) + specialPTagsRegex := regexp.MustCompile(`(?s)]*(?:data-pm-slice|js_darkmode)[^>]*>([\s\S]*?)

`) + if matches := specialPTagsRegex.FindAllStringSubmatch(content, -1); len(matches) > 0 { // 如果找到多个p标签,合并它们的内容 var combinedContent strings.Builder for _, match := range matches { @@ -521,10 +625,11 @@ func (w *WechatCrawler) ExtractArticleInfo(content string) (string, string, stri } } rawContent = combinedContent.String() - } else { - // 尝试一般的p标签,这是微信文章的备用段落格式 - generalPTagsRegex := regexp.MustCompile(`(?s)]*>([\s\S]*?)<\/p>`) - if matches := generalPTagsRegex.FindAllStringSubmatch(content, -1); len(matches) > 10 { // 至少10个p标签才可能是文章内容 + } else if rawContent == "" { + // 查找带有rich_media_p类的p标签(微信文章特有的段落样式) + pTagsRegex := regexp.MustCompile(`(?s)([\s\S]*?)

`) + if matches := pTagsRegex.FindAllStringSubmatch(content, -1); len(matches) > 0 { + // 如果找到多个p标签,合并它们的内容 var combinedContent strings.Builder for _, match := range matches { if len(match) > 1 { @@ -533,6 +638,29 @@ func (w *WechatCrawler) ExtractArticleInfo(content string) (string, string, stri } } rawContent = combinedContent.String() + } else { + // 尝试一般的p标签,这是微信文章的备用段落格式 + generalPTagsRegex := regexp.MustCompile(`(?s)]*>([\s\S]*?)

`) + if matches := generalPTagsRegex.FindAllStringSubmatch(content, -1); len(matches) > 10 { // 至少10个p标签才可能是文章内容 + var combinedContent strings.Builder + for _, match := range matches { + if len(match) > 1 { + // 过滤JavaScript代码:如果段落包含function、var、window等关键词,跳过 + paragraph := match[1] + // 简单过滤:如果段落中包含大量的JavaScript关键词,跳过 + if !strings.Contains(paragraph, "function") && + !strings.Contains(paragraph, "var ") && + !strings.Contains(paragraph, "window.") && + !strings.Contains(paragraph, ".length") { + combinedContent.WriteString(paragraph) + combinedContent.WriteString("\n") + } + } + } + if combinedContent.Len() > 100 { // 只有当合并后的内容超过100字符才认为有效 + rawContent = combinedContent.String() + } + } } } } @@ -759,7 +887,8 @@ func (w *WechatCrawler) ExtractArticleInfo(content string) (string, string, stri cleanText = regexp.MustCompile(`(?s)\s*document\.writeln\s*\([^)]*\);`).ReplaceAllString(cleanText, "") // 如果JavaScript关键词较少且中文密度较高,可能是有效的文章内容 - if (jsCount < 5 || chineseDensity > 0.3) && len(cleanText) > 50 { + // 降低要求:只要中文密度 > 5% 或 长度 > 100 就认为有效 + if (jsCount < 10 || chineseDensity > 0.05) && len(cleanText) > 50 { // 按句子或段落分割,避免一行过长 if len(cleanText) > 0 { // 首先尝试按段落分割 @@ -775,25 +904,17 @@ func (w *WechatCrawler) ExtractArticleInfo(content string) (string, string, stri } // 只添加非空且长度合理的段落(避免添加JavaScript片段) paragraph := strings.TrimSpace(paragraphs[i]) - // 增强过滤条件,避免JavaScript片段,同时考虑中文密度 + // 降低过滤条件,增强中文密度考虑 paraDensity := w.calculateChineseDensity(paragraph) paraJsCount := w.jsKeywordCount(paragraph) - if len(paragraph) > 15 && - !strings.Contains(paragraph, "{") && - !strings.Contains(paragraph, "}") && - !strings.Contains(paragraph, "function") && - !strings.Contains(paragraph, "var") && - !strings.Contains(paragraph, "window.") && - !strings.Contains(paragraph, "WX_BJ_REPORT") && - !strings.Contains(paragraph, "BadJs") && - (paraJsCount < 2 || paraDensity > 0.4) { // 根据中文密度调整JavaScript关键词容忍度 + if len(paragraph) > 10 && (paraJsCount < 3 || paraDensity > 0.1) { textContent = append(textContent, paragraph) } } } // 如果没有成功分割成段落,直接添加整个文本 - if len(textContent) == 0 && len(cleanText) > 50 && (w.jsKeywordCount(cleanText) < 3 || chineseDensity > 0.5) { + if len(textContent) == 0 && len(cleanText) > 50 && (w.jsKeywordCount(cleanText) < 5 || chineseDensity > 0.1) { textContent = append(textContent, cleanText) } } @@ -801,56 +922,97 @@ func (w *WechatCrawler) ExtractArticleInfo(content string) (string, string, stri } // 最后的备选方案:尝试从整个页面中提取非JavaScript的文本内容 - if len(textContent) == 0 { - // 移除所有HTML标签 - allText := regexp.MustCompile(`<[^>]*>`).ReplaceAllString(content, "") + // 【修改】参考Python版本,直接提取所有文本,然后过滤 + if len(textContent) < 5 { // 如果提取的段落很少,说明前面的方法都失败了 + fmt.Printf(" [调试] 前面提取方法只得到%d个段落,尝试简单提取方法\n", len(textContent)) - // 应用增强的JavaScript代码块过滤 - allText = w.filterJavaScriptBlocks(allText) + // 方法1:优先尝试从 id="js_content" 容器中提取 + contentRegex := regexp.MustCompile(`(?s)]*id=["']js_content["'][^>]*>(.*?)\s* 1 { + fmt.Printf(" [调试] 找到 js_content 容器\n") + contentHTML := match[1] - // 进一步清理特定模式 - allText = regexp.MustCompile(`(?s)\s*WX_BJ_REPORT\s*\([^)]*\);`).ReplaceAllString(allText, "") - allText = regexp.MustCompile(`(?s)\s*BadJs\s*\([^)]*\);`).ReplaceAllString(allText, "") - allText = regexp.MustCompile(`(?s)\s*window\.logs\s*=\s*\[.*?\];`).ReplaceAllString(allText, "") - allText = regexp.MustCompile(`(?s)\s*__moon_initcallback\s*=\s*function\s*\([^)]*\)\s*{[^}]*}\s*`).ReplaceAllString(allText, "") - allText = regexp.MustCompile(`(?s)\s*try\s*{[^}]*}\s*catch\s*\([^)]*\)\s*{[^}]*}\s*`).ReplaceAllString(allText, "") - allText = regexp.MustCompile(`(?s)\s*function\s+[^(]*\([^)]*\)\s*{[^}]*}\s*`).ReplaceAllString(allText, "") - allText = regexp.MustCompile(`(?s)\s*var\s+[^=]*=\s*function\s*\([^)]*\)\s*{[^}]*}\s*`).ReplaceAllString(allText, "") - allText = regexp.MustCompile(`(?s)\s*\(function\s*\([^)]*\)\s*{[^}]*}\)\s*\(\);`).ReplaceAllString(allText, "") + // 移除HTML标签,提取文本 + tagRegex := regexp.MustCompile(`<[^>]*>`) + plainText := tagRegex.ReplaceAllString(contentHTML, "\n") - // 使用中文文本提取作为最后手段 - allText = w.extractChineseText(allText) + // 移除HTML实体 + plainText = strings.ReplaceAll(plainText, "<", "<") + plainText = strings.ReplaceAll(plainText, ">", ">") + plainText = strings.ReplaceAll(plainText, """, "\"") + plainText = strings.ReplaceAll(plainText, "&", "&") + plainText = strings.ReplaceAll(plainText, " ", " ") - // 清理空白字符 - spaceRegex := regexp.MustCompile(`\s+`) - allText = spaceRegex.ReplaceAllString(allText, " ") - allText = strings.TrimSpace(allText) - - // 尝试按句子分割 - if allText != "" && len(allText) > 100 { - sentences := regexp.MustCompile(`[。!?.!?]\s*`).Split(allText, -1) - punctuations := regexp.MustCompile(`[。!?.!?]\s*`).FindAllString(allText, -1) - - for i := 0; i < len(sentences); i++ { - if sentences[i] != "" { - if i < len(punctuations) { - sentences[i] += punctuations[i] - } - paragraph := strings.TrimSpace(sentences[i]) - // 过滤掉JavaScript代码和过短的内容,同时考虑中文密度 - if len(paragraph) > 20 && (w.jsKeywordCount(paragraph) < 3 || w.calculateChineseDensity(paragraph) > 0.4) { - textContent = append(textContent, paragraph) - } + // 按行分割,过滤空行 + lines := strings.Split(plainText, "\n") + for _, line := range lines { + line = strings.TrimSpace(line) + if len(line) > 0 { + textContent = append(textContent, line) } } + fmt.Printf(" [调试] 从 js_content 提取到 %d 个段落\n", len(textContent)) + } + + // 方法2:如果仍然很少,尝试提取所有可见文本 + if len(textContent) < 10 { + fmt.Printf(" [调试] js_content 提取不足,尝试全局提取\n") + // 移除script和style标签 + scriptRegex := regexp.MustCompile(`(?s)]*>.*?`) + styleRegex := regexp.MustCompile(`(?s)]*>.*?`) + allText := scriptRegex.ReplaceAllString(content, "") + allText = styleRegex.ReplaceAllString(allText, "") + + // 移除所有HTML标签 + tagRegex := regexp.MustCompile(`<[^>]*>`) + allText = tagRegex.ReplaceAllString(allText, "\n") + + // 移除HTML实体 + allText = strings.ReplaceAll(allText, "<", "<") + allText = strings.ReplaceAll(allText, ">", ">") + allText = strings.ReplaceAll(allText, """, "\"") + allText = strings.ReplaceAll(allText, "&", "&") + allText = strings.ReplaceAll(allText, " ", " ") + + // 按行分割,过滤空行和JS代码 + textContent = []string{} // 重置 + lines := strings.Split(allText, "\n") + for _, line := range lines { + line = strings.TrimSpace(line) + // 基础过滤:只保留有中文的行,且不是明显JS代码 + if len(line) > 0 && + !strings.HasPrefix(line, "var ") && + !strings.HasPrefix(line, "function") && + !strings.Contains(line, "window.") && + w.calculateChineseDensity(line) > 0.1 { + textContent = append(textContent, line) + } + } + fmt.Printf(" [调试] 全局提取到 %d 个段落\n", len(textContent)) } } // 对提取的内容应用最终过滤,确保只保留真正的文章正文 filteredContent := w.finalContentFilter(textContent) + + // 【调试】输出过滤前后的对比 + fmt.Printf(" [调试] 过滤前段落数: %d, 过滤后段落数: %d\n", len(textContent), len(filteredContent)) + if len(filteredContent) == 0 && len(textContent) > 0 { + fmt.Printf(" [调试] ⚠️ finalContentFilter 过滤掉了所有内容!\n") + fmt.Printf(" [调试] 过滤前第一段示例: %s\n", textContent[0][:min(len(textContent[0]), 200)]) + } + return createTime, title, commentID, reqID, w.extractAuthor(content), filteredContent } +// min 返回两个整数中的最小值(Go 1.21之前需要手动实现) +func min(a, b int) int { + if a < b { + return a + } + return b +} + // calculateChineseDensity 计算文本中中文字符的密度 func (w *WechatCrawler) calculateChineseDensity(text string) float64 { if len(text) == 0 { @@ -921,91 +1083,77 @@ func (w *WechatCrawler) extractChineseText(text string) string { } // finalContentFilter 最终内容过滤,确保只保留真正的文章正文 -func (w *WechatCrawler) finalContentFilter(text string) string { - // 1. 移除明显的JavaScript代码块 - // 移除WX_BJ_REPORT相关代码 - wxCodeRegex := regexp.MustCompile(`(?s)\s*WX_BJ_REPORT\s*\([^)]*\);|\s*var\s+WX_BJ_REPORT\s*=\s*function\s*\([^)]*\)\s*{[^}]*}\s*|\s*if\s*\(WX_BJ_REPORT\)[^;]*;`) - text = wxCodeRegex.ReplaceAllString(text, "") - - // 移除BadJs相关代码 - badJsRegex := regexp.MustCompile(`(?s)\s*BadJs\s*\([^)]*\);|\s*var\s+BadJs\s*=\s*function\s*\([^)]*\)\s*{[^}]*}\s*|\s*if\s*\(BadJs\)[^;]*;`) - text = badJsRegex.ReplaceAllString(text, "") - - // 移除window.logs相关代码 - logsRegex := regexp.MustCompile(`(?s)\s*window\.logs\s*=\s*\[.*?\];|\s*window\.logs\s*\..*?;`) - text = logsRegex.ReplaceAllString(text, "") - - // 移除函数定义 - funcRegex := regexp.MustCompile(`(?s)\s*function\s+[^(]*\([^)]*\)\s*{[^}]*}\s*|\s*var\s+[^=]*=\s*function\s*\([^)]*\)\s*{[^}]*}\s*|\s*[a-zA-Z_$][a-zA-Z0-9_$]*\s*=\s*function\s*\([^)]*\)\s*{[^}]*}\s*`) - text = funcRegex.ReplaceAllString(text, "") - - // 移除变量声明 - varRegex := regexp.MustCompile(`(?s)\s*var\s+[a-zA-Z_$][a-zA-Z0-9_$]*\s*=\s*{[^}]*}\s*;?|\s*let\s+[a-zA-Z_$][a-zA-Z0-9_$]*\s*=\s*[^;]*;|\s*const\s+[a-zA-Z_$][a-zA-Z0-9_$]*\s*=\s*[^;]*;|\s*window\.[a-zA-Z_$][a-zA-Z0-9_$]*\s*=\s*[^;]*;`) - text = varRegex.ReplaceAllString(text, "") - - // 移除控制流语句 - flowRegex := regexp.MustCompile(`(?s)\s*if\s*\([^)]*\)\s*{[^}]*}\s*|\s*for\s*\([^)]*\)\s*{[^}]*}\s*|\s*while\s*\([^)]*\)\s*{[^}]*}\s*`) - text = flowRegex.ReplaceAllString(text, "") - - // 2. 提取真正的文章段落 - paragraphs := regexp.MustCompile(`[。!?.!?]\s*`).Split(text, -1) - punctuations := regexp.MustCompile(`[。!?.!?]\s*`).FindAllString(text, -1) - +// 修改:大幅降低过滤门槛,参考Python版本的简单逻辑 +func (w *WechatCrawler) finalContentFilter(textContent []string) []string { var validParagraphs []string - for i := 0; i < len(paragraphs); i++ { - if paragraphs[i] != "" { - paragraph := paragraphs[i] - if i < len(punctuations) { - paragraph += punctuations[i] - } - paragraph = strings.TrimSpace(paragraph) - // 计算段落特征 - paraDensity := w.calculateChineseDensity(paragraph) - paraJsCount := w.jsKeywordCount(paragraph) - chineseCount := 0 - for _, char := range paragraph { - if char >= 0x4e00 && char <= 0x9fa5 { - chineseCount++ - } - } - - // 严格的过滤规则 - if len(paragraph) > 25 && // 足够长的段落 - !strings.Contains(paragraph, "{") && - !strings.Contains(paragraph, "}") && - !strings.Contains(paragraph, "function") && - !strings.Contains(paragraph, "var") && - !strings.Contains(paragraph, "window.") && - !strings.Contains(paragraph, "WX_BJ_REPORT") && - !strings.Contains(paragraph, "BadJs") && - chineseCount > 15 && // 至少15个中文字符 - paraDensity > 0.4 && // 中文密度大于40% - paraJsCount < 3 { // JavaScript关键词少于3个 - validParagraphs = append(validParagraphs, paragraph) + // 【修改】如果提取的段落很少,说明可能是提取阶段的问题,直接返回 + if len(textContent) <= 3 { + fmt.Printf(" [调试] 提取的段落太少(%d个),可能提取逻辑有问题,跳过过滤\n", len(textContent)) + // 简单过滤:只去掉纯标题行和过短的内容 + for _, text := range textContent { + text = strings.TrimSpace(text) + // 去掉明显的JavaScript关键词行 + if len(text) > 5 && + !strings.Contains(text, "function(") && + !strings.Contains(text, "window.") && + !strings.Contains(text, "var ") { + validParagraphs = append(validParagraphs, text) } } + return validParagraphs } - // 3. 如果没有找到有效的段落,尝试使用更宽松的规则 - if len(validParagraphs) == 0 { - // 直接检查整个文本 - overallDensity := w.calculateChineseDensity(text) - overallJsCount := w.jsKeywordCount(text) - overallChineseCount := 0 + // 【修改】降低过滤标准,参考Python版本 + for _, text := range textContent { + // 基础清理 + text = strings.TrimSpace(text) + + // 计算中文字符数 + chineseCount := 0 for _, char := range text { if char >= 0x4e00 && char <= 0x9fa5 { - overallChineseCount++ + chineseCount++ } } - // 宽松条件:如果中文密度很高且JavaScript关键词较少 - if overallDensity > 0.6 && overallJsCount < 5 && overallChineseCount > 100 { + // 计算中文密度 + paraDensity := w.calculateChineseDensity(text) + paraJsCount := w.jsKeywordCount(text) + + // 【大幅降低门槛】: + // - 长度 > 10(原来25) + // - 中文字符 > 3(原来15) + // - 中文密度 > 0.15(原来0.4) + // - JavaScript关键词 < 5(原来3) + if len(text) > 10 && + !strings.Contains(text, "function(") && + !strings.Contains(text, "window.") && + !strings.Contains(text, "WX_BJ_REPORT") && + !strings.Contains(text, "BadJs") && + chineseCount > 3 && + paraDensity > 0.15 && + paraJsCount < 5 { validParagraphs = append(validParagraphs, text) } } - return strings.Join(validParagraphs, "\n\n") + // 【新增】如果过滤后还是空的,使用最宽松的规则 + if len(validParagraphs) == 0 && len(textContent) > 0 { + fmt.Printf(" [调试] 标准过滤后仍为空,使用最宽松规则\n") + for _, text := range textContent { + text = strings.TrimSpace(text) + // 只要有中文字符且不是明显的JS代码就保留 + overallDensity := w.calculateChineseDensity(text) + overallJsCount := w.jsKeywordCount(text) + + if len(text) > 5 && overallDensity > 0.1 && overallJsCount < 10 { + validParagraphs = append(validParagraphs, text) + } + } + } + + return validParagraphs } // jsKeywordCount 计算文本中JavaScript关键词的数量 - 增强版 @@ -1270,8 +1418,16 @@ func (w *WechatCrawler) GetArticleList() ([][]string, error) { } // 检查是否还有更多文章 - mFlag, ok := result["m_flag"].(float64) - if !ok || mFlag == 0 { + mFlag, ok := result["m_flag"].(int) + if !ok { + // 尝试转换为float64(JSON反序列化可能将数字解析为float64) + if mFlagFloat, ok := result["m_flag"].(float64); ok { + mFlag = int(mFlagFloat) + } else { + mFlag = 0 + } + } + if mFlag == 0 { break } @@ -1309,12 +1465,72 @@ func (w *WechatCrawler) SaveArticleListToExcel(officialPath string, articleList filePath := fmt.Sprintf("%s/文章列表(article_list)_直连链接.txt", officialPath) var content strings.Builder + // 添加 UTF-8 BOM 头,确保 Excel 正确识别编码 + content.WriteString("\xEF\xBB\xBF") + // 写入标题行 content.WriteString("序号,创建时间,标题,链接\n") // 写入文章列表 for i, article := range articleList { - content.WriteString(fmt.Sprintf("%d,%s,%s,%s\n", i+1, article[1], article[2], article[3])) + if len(article) < 4 { + continue // 跳过不完整的数据 + } + + // 转换时间戳为可读格式(如果是时间戳) + createTime := article[1] + + // 调试输出:查看原始时间戳 + if i == 0 { // 只打印第一篇文章,避免输出过多 + fmt.Printf("调试信息 - 第1篇文章\n") + fmt.Printf(" article[0]: '%s'\n", article[0]) + fmt.Printf(" article[1] (时间戳): '%s'\n", article[1]) + fmt.Printf(" article[2] (标题): '%s'\n", article[2]) + fmt.Printf(" 时间戳长度: %d\n", len(article[1])) + } + + if createTime != "" && createTime != "0" { + // 尝试将字符串转换为时间戳 + var ts int64 + n, err := fmt.Sscanf(createTime, "%d", &ts) + if i == 0 { + fmt.Printf(" Sscanf 结果: n=%d, err=%v, ts=%d\n", n, err, ts) + } + if err == nil && n == 1 && ts > 0 { + // 转换为可读的日期时间格式 + createTime = time.Unix(ts, 0).Format("2006-01-02 15:04:05") + if i == 0 { + fmt.Printf(" 转换后的时间: %s\n", createTime) + } + } else { + // 如果转换失败,保留原始值 + if i == 0 { + fmt.Printf(" 转换失败,保留原始值: %s\n", createTime) + } + } + } else { + if i == 0 { + fmt.Printf(" 时间戳为空或为0,设置为'未知时间'\n") + } + createTime = "未知时间" + } + + // 清理和转义标题(移除换行符、制表符等) + title := strings.TrimSpace(article[2]) + title = strings.ReplaceAll(title, "\n", " ") + title = strings.ReplaceAll(title, "\r", " ") + title = strings.ReplaceAll(title, "\t", " ") + + // 如果标题包含逗号或引号,需要用双引号包裹并转义内部引号 + if strings.Contains(title, ",") || strings.Contains(title, "\"") || strings.Contains(title, "\n") { + title = "\"" + strings.ReplaceAll(title, "\"", "\"\"") + "\"" + } + + // 清理链接 + link := strings.TrimSpace(article[3]) + + // 写入CSV行 + content.WriteString(fmt.Sprintf("%d,%s,%s,%s\n", i+1, createTime, title, link)) } // 写入文件 @@ -1324,6 +1540,7 @@ func (w *WechatCrawler) SaveArticleListToExcel(officialPath string, articleList } fmt.Printf("文章列表已保存到: %s\n", filePath) + fmt.Printf("共保存 %d 篇文章\n", len(articleList)) return nil } @@ -1357,9 +1574,27 @@ func (w *WechatCrawler) GetArticleDetail(link string) (*ArticleDetail, error) { return nil, err } + // 【调试】保存原始HTML到文件,用于分析内容提取问题 + debugPath := "./debug_article_raw.html" + if err := os.WriteFile(debugPath, []byte(content), 0644); err == nil { + fmt.Printf(" [调试] 原始HTML已保存: %s (长度: %d 字节)\n", debugPath, len(content)) + } + // 提取文章信息 createTime, title, commentID, reqID, _, textContent := w.ExtractArticleInfo(content) + // 【调试】输出内容提取详情 + fmt.Printf(" [调试] 提取结果 - 标题: %s, 段落数: %d\n", title, len(textContent)) + if len(textContent) > 0 { + firstPara := textContent[0] + if len(firstPara) > 100 { + firstPara = firstPara[:100] + "..." + } + fmt.Printf(" [调试] 第一段: %s\n", firstPara) + } else { + fmt.Printf(" [调试] ⚠️ ExtractArticleInfo 未提取到任何内容!\n") + } + // 提取公众号名称 accountName := w.ExtractOfficialAccountName(content) @@ -1428,8 +1663,19 @@ func (w *WechatCrawler) GetDetailList(articleList [][]string, officialPath strin continue } - // 保存文章详情 - 确保使用文章标题作为文件名 - filePath := fmt.Sprintf("%s/%s_文章详情.txt", officialPath, detail.Title) + // 保存文章详情 - 确保使用文章标题作为文件名,并清理非法字符 + // 清理标题中的非法字符 + cleanTitle := detail.Title + invalidChars := []string{"\\", "/", ":", "*", "?", "\"", "<", ">", "|"} + for _, char := range invalidChars { + cleanTitle = strings.ReplaceAll(cleanTitle, char, "_") + } + // 限制文件名长度,避免路径过长 + if len(cleanTitle) > 100 { + cleanTitle = cleanTitle[:100] + } + + filePath := fmt.Sprintf("%s/%s_文章详情.txt", officialPath, cleanTitle) if err := w.SaveArticleDetailToExcel(detail, filePath); err != nil { fmt.Printf("保存文章详情失败: %v\n", err) errorCount++ @@ -1465,9 +1711,18 @@ func (w *WechatCrawler) GetDetailList(articleList [][]string, officialPath strin // SaveArticleDetailToExcel 保存文章详情到Excel func (c *WechatCrawler) SaveArticleDetailToExcel(article *ArticleDetail, filePath string) error { - // 简化实现,保存为文本文件 + // 【修复】不要清理整个路径!只需要确保目录存在即可 + // filePath 已经在调用处清理过了文件名部分 + // 这里直接使用即可 + var content strings.Builder + // 添加 UTF-8 BOM 头,确保正确显示中文 + content.WriteString("\xEF\xBB\xBF") + + content.WriteString("=") + content.WriteString(strings.Repeat("=", 80)) + content.WriteString("\n") content.WriteString(fmt.Sprintf("本地创建时间: %s\n", article.LocalTime)) content.WriteString(fmt.Sprintf("文章发布时间: %s\n", article.CreateTime)) content.WriteString(fmt.Sprintf("公众号名称: %s\n", article.OfficialName)) @@ -1477,15 +1732,57 @@ func (c *WechatCrawler) SaveArticleDetailToExcel(article *ArticleDetail, filePat content.WriteString(fmt.Sprintf("点赞数: %s\n", article.LikeCount)) content.WriteString(fmt.Sprintf("转发数: %s\n", article.ShareCount)) content.WriteString(fmt.Sprintf("在看数: %s\n", article.ShowRead)) - content.WriteString("\n文章内容:\n") + content.WriteString(strings.Repeat("=", 80)) + content.WriteString("\n\n") - for _, line := range article.Content { - content.WriteString(line) - content.WriteString("\n") + content.WriteString("文章内容:\n") + content.WriteString(strings.Repeat("-", 80)) + content.WriteString("\n") + + for i, line := range article.Content { + // 清理内容,移除多余的空白字符 + cleanLine := strings.TrimSpace(line) + if cleanLine != "" { + content.WriteString(cleanLine) + content.WriteString("\n") + + // 每个段落后添加空行,提高可读性 + if i < len(article.Content)-1 { + content.WriteString("\n") + } + } } + // 如果有评论,添加评论区 + if len(article.Comments) > 0 { + content.WriteString("\n") + content.WriteString(strings.Repeat("=", 80)) + content.WriteString("\n") + content.WriteString(fmt.Sprintf("评论区 (共 %d 条评论):\n", len(article.Comments))) + content.WriteString(strings.Repeat("-", 80)) + content.WriteString("\n\n") + + for i, comment := range article.Comments { + content.WriteString(fmt.Sprintf("%d. %s", i+1, comment)) + if i < len(article.CommentLikes) && article.CommentLikes[i] != "" { + content.WriteString(fmt.Sprintf(" (点赞: %s)", article.CommentLikes[i])) + } + content.WriteString("\n\n") + } + } + + content.WriteString("\n") + content.WriteString(strings.Repeat("=", 80)) + content.WriteString("\n") + content.WriteString("文件结束\n") + // 写入文件 - return os.WriteFile(filePath, []byte(content.String()), 0644) + err := os.WriteFile(filePath, []byte(content.String()), 0644) + if err != nil { + return fmt.Errorf("保存文章详情失败: %v", err) + } + + return nil } // GetListArticleFromFile 根据公众号名称或文章链接,从文件中读取文章列表并下载内容 @@ -1495,14 +1792,14 @@ func (w *WechatCrawler) GetListArticleFromFile(nameLink string, imgSaveFlag bool if strings.Contains(nameLink, "http") { fmt.Println("检测到输入为链接,开始获取公众号名称") // 从文章链接获取公众号信息 - _, err := w.GetOfficialAccountLinkFromArticle(nameLink) + content, err := w.GetOneArticle(nameLink) if err != nil { - return fmt.Errorf("获取公众号信息失败: %v", err) + return fmt.Errorf("获取文章内容失败: %v", err) } - // 获取公众号名称 - nickname, err = w.GetOfficialAccountName() - if err != nil { - return fmt.Errorf("获取公众号名称失败: %v", err) + // 从内容中提取公众号名称 + nickname = w.ExtractOfficialAccountName(content) + if nickname == "" { + return fmt.Errorf("无法从文章中提取公众号名称") } fmt.Printf("获取到公众号名称: %s\n", nickname) } else { @@ -1512,8 +1809,9 @@ func (w *WechatCrawler) GetListArticleFromFile(nameLink string, imgSaveFlag bool // 2. 构建文件路径 rootPath := "./data/" - officialNamesHead := "公众号----" - officialPath := rootPath + officialNamesHead + nickname + officialPath := rootPath + nickname + // 【新增】创建"文章详细"子目录 + articleDetailPath := officialPath + "/文章详细" articleListPath := officialPath + "/文章列表(article_list)_直连链接.txt" // 3. 检查文件是否存在 @@ -1529,47 +1827,140 @@ func (w *WechatCrawler) GetListArticleFromFile(nameLink string, imgSaveFlag bool lines := strings.Split(string(fileContent), "\n") var articleLinks []string + var articleTitles []string + var articleTimes []string - // 跳过标题行,提取链接 + // 跳过BOM头和标题行,提取链接 for i, line := range lines { if i == 0 || line == "" { continue } - parts := strings.Split(line, ",") + // 移除可能的BOM头 + line = strings.TrimPrefix(line, "\xEF\xBB\xBF") + line = strings.TrimSpace(line) + if line == "" { + continue + } + + // 解析CSV行(处理带引号的字段) + var parts []string + inQuote := false + currentPart := "" + for _, char := range line { + if char == '"' { + inQuote = !inQuote + } else if char == ',' && !inQuote { + parts = append(parts, currentPart) + currentPart = "" + } else { + currentPart += string(char) + } + } + parts = append(parts, currentPart) // 添加最后一个字段 + if len(parts) >= 4 { - link := parts[3] - // 清理链接中的引号 - link = strings.TrimSpace(link) + // 序号,创建时间,标题,链接 + time := strings.TrimSpace(parts[1]) + title := strings.TrimSpace(parts[2]) + link := strings.TrimSpace(parts[3]) + // 清理引号 link = strings.Trim(link, "\"") - articleLinks = append(articleLinks, link) + title = strings.Trim(title, "\"") + + if link != "" && link != "链接" { // 跳过标题行 + articleLinks = append(articleLinks, link) + articleTitles = append(articleTitles, title) + articleTimes = append(articleTimes, time) + } } } - fmt.Printf("成功读取到%d篇文章链接\n", len(articleLinks)) + fmt.Printf("成功读取到 %d 篇文章链接\n", len(articleLinks)) + if len(articleLinks) == 0 { + return fmt.Errorf("未能从文件中提取到有效的文章链接") + } // 5. 遍历下载每篇文章 successCount := 0 errorCount := 0 + errorLinks := [][]string{} // 保存失败的文章信息 + + // 【新增】确保"文章详细"目录存在 + if err := os.MkdirAll(articleDetailPath, 0755); err != nil { + return fmt.Errorf("创建文章详细目录失败: %v", err) + } + fmt.Printf("文章详细将保存到: %s\n", articleDetailPath) for i, link := range articleLinks { - fmt.Printf("正在处理第%d篇文章,链接: %s\n", i+1, link) + title := "" + if i < len(articleTitles) { + title = articleTitles[i] + } + creatTime := "" + if i < len(articleTimes) { + creatTime = articleTimes[i] + } + + fmt.Printf("\n正在处理第 %d/%d 篇文章\n", i+1, len(articleLinks)) + fmt.Printf("标题: %s\n", title) + fmt.Printf("链接: %s\n", link) // 获取文章详情 detail, err := w.GetArticleDetail(link) if err != nil { - fmt.Printf("获取文章详情失败: %v\n", err) + fmt.Printf("❌ 获取文章详情失败: %v\n", err) errorCount++ + // 记录失败的文章 + errorLinks = append(errorLinks, []string{ + fmt.Sprintf("%d", i+1), + creatTime, + title, + link, + }) continue } // 保存文章内容 if contentSaveFlag { - filePath := fmt.Sprintf("%s/%s_文章详情.txt", officialPath, detail.Title) + // 清理标题中的非法字符 + cleanTitle := detail.Title + invalidChars := []string{"\\", "/", ":", "*", "?", "\"", "<", ">", "|"} + for _, char := range invalidChars { + cleanTitle = strings.ReplaceAll(cleanTitle, char, "_") + } + // 限制文件名长度 + if len(cleanTitle) > 100 { + cleanTitle = cleanTitle[:100] + } + + // 【修改】生成文件路径,保存到"文章详细"子目录中 + filePath := fmt.Sprintf("%s/%s_文章详情.txt", articleDetailPath, cleanTitle) + + // 调试:打印文件保存路径和内容长度 + fmt.Printf(" 保存路径: %s\n", filePath) + fmt.Printf(" 内容段落数: %d\n", len(detail.Content)) + if len(detail.Content) > 0 { + previewLen := 50 + if len(detail.Content[0]) < previewLen { + previewLen = len(detail.Content[0]) + } + fmt.Printf(" 第一段内容预览: %s...\n", detail.Content[0][:previewLen]) + } else { + fmt.Printf(" ⚠️ 警告:文章内容为空!\n") + } + if err := w.SaveArticleDetailToExcel(detail, filePath); err != nil { - fmt.Printf("保存文章详情失败: %v\n", err) + fmt.Printf("❌ 保存文章详情失败: %v\n", err) errorCount++ + errorLinks = append(errorLinks, []string{ + fmt.Sprintf("%d", i+1), + creatTime, + title, + link, + }) continue } + fmt.Printf("✅ 文章保存成功: %s\n", detail.Title) } // TODO: 保存图片功能(如果需要) @@ -1578,12 +1969,44 @@ func (w *WechatCrawler) GetListArticleFromFile(nameLink string, imgSaveFlag bool } successCount++ - fmt.Printf("第%d篇文章处理成功: %s\n", i+1, detail.Title) // 添加延迟,避免被封 - time.Sleep(3 * time.Second) + if i < len(articleLinks)-1 { // 不是最后一篇 + delayTime := 3 + i/10 // 基础延迟3秒,每10篇增加1秒 + fmt.Printf("为预防被封禁,延时 %d 秒...\n", delayTime) + time.Sleep(time.Duration(delayTime) * time.Second) + } } - fmt.Printf("文章列表处理完成: 成功%d篇, 失败%d篇\n", successCount, errorCount) + // 6. 保存失败的文章链接 + if len(errorLinks) > 0 { + errorPath := officialPath + "/问题链接(error_links).txt" + var errorContent strings.Builder + // 添加 BOM 头 + errorContent.WriteString("\xEF\xBB\xBF") + errorContent.WriteString("序号,创建时间,标题,链接\n") + for _, errorLink := range errorLinks { + // 处理标题中的逗号和引号 + title := errorLink[2] + if strings.Contains(title, ",") || strings.Contains(title, "\"") { + title = "\"" + strings.ReplaceAll(title, "\"", "\"\"") + "\"" + } + errorContent.WriteString(fmt.Sprintf("%s,%s,%s,%s\n", + errorLink[0], errorLink[1], title, errorLink[3])) + } + err := os.WriteFile(errorPath, []byte(errorContent.String()), 0644) + if err != nil { + fmt.Printf("⚠️ 保存错误链接失败: %v\n", err) + } else { + fmt.Printf("\n已保存失败的文章链接到: %s\n", errorPath) + } + } + + fmt.Printf("\n" + strings.Repeat("=", 60) + "\n") + fmt.Printf("文章列表处理完成!\n") + fmt.Printf(" 成功: %d 篇\n", successCount) + fmt.Printf(" 失败: %d 篇\n", errorCount) + fmt.Printf(" 总计: %d 篇\n", len(articleLinks)) + fmt.Printf(strings.Repeat("=", 60) + "\n") return nil } diff --git a/backend/run.bat b/backend/run.bat deleted file mode 100644 index 6715ae3..0000000 --- a/backend/run.bat +++ /dev/null @@ -1,48 +0,0 @@ -@echo off - -echo WeChat Public Article Crawler Startup Script -echo ================================= - -REM Check if cookie.txt file exists -if not exist "cookie.txt" ( - echo Error: cookie.txt file not found! - echo Please create cookie.txt file in backend directory and add WeChat public platform cookie information. - echo. - echo cookie.txt format example: - echo __biz=xxx; uin=xxx; key=xxx; pass_ticket=xxx; - echo. - pause - exit /b 1 -) - -REM Set Go environment variables (if needed) -REM set GOPATH=%USERPROFILE%\go -REM set GOROOT=C:\Go -REM set PATH=%PATH%;%GOROOT%\bin;%GOPATH%\bin - -echo Downloading dependencies... -go mod tidy -if %errorlevel% neq 0 ( - echo Failed to download dependencies! - pause - exit /b 1 -) - -echo Compiling program... -go build -o output\wechat-crawler.exe cmd\main.go -if %errorlevel% neq 0 ( - echo Compilation failed! - pause - exit /b 1 -) - -echo Compilation successful! Starting program... -echo. - -REM Ensure data directory exists -if not exist "data" mkdir data - -REM Run the program -output\wechat-crawler.exe - -pause \ No newline at end of file diff --git a/backend/run_article_link.bat b/backend/run_article_link.bat deleted file mode 100644 index b356bf5..0000000 --- a/backend/run_article_link.bat +++ /dev/null @@ -1,57 +0,0 @@ -@echo off - -rem WeChat Official Account Article Crawler - Script for crawling via article link -setlocal enabledelayedexpansion - -REM 检查是否有命令行参数传入 -if "%1" neq "" ( - REM 如果有参数,直接将其作为文章链接传入程序 - echo. - echo Compiling and running... - go run "cmd/main.go" "%1" - - if errorlevel 1 ( - echo. - echo Failed to run, please check error messages above - pause - exit /b 1 - ) - - echo. - echo Crawling completed successfully! - pause - exit /b 0 -) else ( - REM 如果没有参数,运行交互式模式 - :input_loop - cls - echo ======================================== - echo WeChat Official Account Article Crawler - echo ======================================== - echo. - echo Please enter WeChat article link: - echo Example: https://mp.weixin.qq.com/s/4r_LKJu0mOeUc70ZZXK9LA - set /p ARTICLE_LINK= - - if "%ARTICLE_LINK%"=="" ( - echo. - echo Error: Article link cannot be empty! - pause - goto input_loop - ) - - echo. - echo Compiling and running... - go run "cmd/main.go" "%ARTICLE_LINK%" - - if errorlevel 1 ( - echo. - echo Failed to run, please check error messages above - pause - exit /b 1 - ) - - echo. - echo Crawling completed successfully! - pause -) \ No newline at end of file diff --git a/backend/tools/view_db.bat b/backend/tools/view_db.bat new file mode 100644 index 0000000..482bed2 --- /dev/null +++ b/backend/tools/view_db.bat @@ -0,0 +1,21 @@ +@echo off +chcp 65001 >nul +cls + +echo =============================================== +echo 📊 数据库内容查看工具 +echo =============================================== +echo. + +cd /d "%~dp0" + +echo 正在查询数据库... +echo. + +go run view_db.go + +echo. +echo =============================================== +echo 查询完成! +echo =============================================== +pause diff --git a/backend/tools/view_db.go b/backend/tools/view_db.go new file mode 100644 index 0000000..d0e4f19 --- /dev/null +++ b/backend/tools/view_db.go @@ -0,0 +1,231 @@ +package main + +import ( + "database/sql" + "encoding/json" + "fmt" + "log" + + _ "modernc.org/sqlite" +) + +func main() { + // 打开数据库 + db, err := sql.Open("sqlite", "../../data/wechat_articles.db") + if err != nil { + log.Fatal("打开数据库失败:", err) + } + defer db.Close() + + fmt.Println("=" + repeatStr("=", 80)) + fmt.Println("📊 微信公众号文章数据库内容查看") + fmt.Println("=" + repeatStr("=", 80)) + + // 查询公众号 + fmt.Println("\n📢 【公众号列表】") + fmt.Println(repeatStr("-", 80)) + queryOfficialAccounts(db) + + // 查询文章 + fmt.Println("\n📝 【文章列表】") + fmt.Println(repeatStr("-", 80)) + queryArticles(db) + + // 查询文章内容 + fmt.Println("\n📄 【文章详细内容】") + fmt.Println(repeatStr("-", 80)) + queryArticleContents(db) + + fmt.Println("\n" + repeatStr("=", 80)) +} + +func queryOfficialAccounts(db *sql.DB) { + rows, err := db.Query(` + SELECT id, biz, nickname, homepage, description, created_at, updated_at + FROM official_accounts + ORDER BY id + `) + if err != nil { + log.Printf("查询公众号失败: %v\n", err) + return + } + defer rows.Close() + + count := 0 + for rows.Next() { + var id int + var biz, nickname, homepage, description, createdAt, updatedAt string + err := rows.Scan(&id, &biz, &nickname, &homepage, &description, &createdAt, &updatedAt) + if err != nil { + log.Printf("读取数据失败: %v\n", err) + continue + } + count++ + + fmt.Printf("\n🔹 公众号 #%d\n", id) + fmt.Printf(" 名称: %s\n", nickname) + fmt.Printf(" BIZ: %s\n", biz) + fmt.Printf(" 主页: %s\n", homepage) + fmt.Printf(" 简介: %s\n", description) + fmt.Printf(" 创建时间: %s\n", createdAt) + fmt.Printf(" 更新时间: %s\n", updatedAt) + } + + if count == 0 { + fmt.Println(" 暂无数据") + } else { + fmt.Printf("\n总计: %d 个公众号\n", count) + } +} + +func queryArticles(db *sql.DB) { + rows, err := db.Query(` + SELECT a.id, a.official_id, a.title, a.author, a.link, a.publish_time, + a.read_num, a.like_num, a.share_num, a.paragraph_count, + a.content_preview, a.created_at, oa.nickname + FROM articles a + LEFT JOIN official_accounts oa ON a.official_id = oa.id + ORDER BY a.id + `) + if err != nil { + log.Printf("查询文章失败: %v\n", err) + return + } + defer rows.Close() + + count := 0 + for rows.Next() { + var id, officialID, readNum, likeNum, shareNum, paragraphCount int + var title, author, link, publishTime, contentPreview, createdAt, officialName sql.NullString + err := rows.Scan(&id, &officialID, &title, &author, &link, &publishTime, + &readNum, &likeNum, &shareNum, ¶graphCount, &contentPreview, &createdAt, &officialName) + if err != nil { + log.Printf("读取数据失败: %v\n", err) + continue + } + count++ + + fmt.Printf("\n🔹 文章 #%d\n", id) + fmt.Printf(" 标题: %s\n", getStringValue(title)) + if officialName.Valid { + fmt.Printf(" 公众号: %s\n", officialName.String) + } + fmt.Printf(" 作者: %s\n", getStringValue(author)) + fmt.Printf(" 链接: %s\n", getStringValue(link)) + fmt.Printf(" 发布时间: %s\n", getStringValue(publishTime)) + fmt.Printf(" 阅读数: %d | 点赞数: %d | 分享数: %d\n", readNum, likeNum, shareNum) + fmt.Printf(" 段落数: %d\n", paragraphCount) + if contentPreview.Valid && contentPreview.String != "" { + preview := contentPreview.String + if len(preview) > 100 { + preview = preview[:100] + "..." + } + fmt.Printf(" 内容预览: %s\n", preview) + } + fmt.Printf(" 抓取时间: %s\n", getStringValue(createdAt)) + } + + if count == 0 { + fmt.Println(" 暂无数据") + } else { + fmt.Printf("\n总计: %d 篇文章\n", count) + } +} + +func queryArticleContents(db *sql.DB) { + rows, err := db.Query(` + SELECT ac.id, ac.article_id, ac.html_content, ac.text_content, + ac.paragraphs, ac.images, ac.created_at, a.title + FROM article_contents ac + LEFT JOIN articles a ON ac.article_id = a.id + ORDER BY ac.id + `) + if err != nil { + log.Printf("查询文章内容失败: %v\n", err) + return + } + defer rows.Close() + + count := 0 + for rows.Next() { + var id, articleID int + var htmlContent, textContent, paragraphs, images, createdAt, title sql.NullString + err := rows.Scan(&id, &articleID, &htmlContent, &textContent, + ¶graphs, &images, &createdAt, &title) + if err != nil { + log.Printf("读取数据失败: %v\n", err) + continue + } + count++ + + fmt.Printf("\n🔹 内容 #%d (文章ID: %d)\n", id, articleID) + if title.Valid { + fmt.Printf(" 文章标题: %s\n", title.String) + } + + // HTML内容长度 + htmlLen := 0 + if htmlContent.Valid { + htmlLen = len(htmlContent.String) + } + fmt.Printf(" HTML内容长度: %d 字符\n", htmlLen) + + // 文本内容 + if textContent.Valid && textContent.String != "" { + text := textContent.String + if len(text) > 200 { + text = text[:200] + "..." + } + fmt.Printf(" 文本内容: %s\n", text) + } + + // 段落信息 + if paragraphs.Valid && paragraphs.String != "" { + var paragraphList []interface{} + if err := json.Unmarshal([]byte(paragraphs.String), ¶graphList); err == nil { + fmt.Printf(" 段落数量: %d\n", len(paragraphList)) + } + } + + // 图片信息 + if images.Valid && images.String != "" { + var imageList []interface{} + if err := json.Unmarshal([]byte(images.String), &imageList); err == nil { + fmt.Printf(" 图片数量: %d\n", len(imageList)) + if len(imageList) > 0 { + fmt.Printf(" 图片URL:\n") + for i, img := range imageList { + if i >= 3 { + fmt.Printf(" ... 还有 %d 张图片\n", len(imageList)-3) + break + } + fmt.Printf(" %d. %v\n", i+1, img) + } + } + } + } + + fmt.Printf(" 存储时间: %s\n", getStringValue(createdAt)) + } + + if count == 0 { + fmt.Println(" 暂无数据") + } else { + fmt.Printf("\n总计: %d 条详细内容\n", count) + } +} + +func getStringValue(s sql.NullString) string { + if s.Valid { + return s.String + } + return "" +} + +func repeatStr(s string, n int) string { + result := "" + for i := 0; i < n; i++ { + result += s + } + return result +} diff --git a/启动Web系统.bat b/启动Web系统.bat new file mode 100644 index 0000000..f4b7eec --- /dev/null +++ b/启动Web系统.bat @@ -0,0 +1,49 @@ +@echo off +chcp 65001 >nul +cls + +echo =============================================== +echo 🚀 微信公众号文章爬虫 - Web系统启动器 +echo =============================================== +echo. +echo 正在启动系统,请稍候... +echo. + +:: 启动API服务器(后台运行) +echo [1/2] 启动 API 服务器... +cd backend\api +start "微信爬虫-API服务器" cmd /c "start_api.bat" +cd ..\.. +timeout /t 2 /nobreak >nul + +:: 启动前端服务器 +echo [2/2] 启动 前端服务器... +cd frontend +start "微信爬虫-前端服务器" cmd /c "start_web.bat" +cd .. + +echo. +echo =============================================== +echo ✅ 系统启动完成! +echo =============================================== +echo. +echo 📝 重要提示: +echo. +echo 1️⃣ API服务器: http://localhost:8080 +echo - 提供后端接口服务 +echo - 窗口标题: "微信爬虫-API服务器" +echo. +echo 2️⃣ 前端界面: http://localhost:8000 +echo - Web操作界面 +echo - 窗口标题: "微信爬虫-前端服务器" +echo. +echo ⚠️ 请不要关闭这两个窗口! +echo. +echo 💡 使用说明: +echo - 浏览器会自动打开前端界面 +echo - 如未自动打开,请手动访问 http://localhost:8000 +echo - 使用完毕后,关闭两个服务器窗口即可 +echo. +echo =============================================== + +pause