用Python读取sitemap并调用百度接口推送URL
SEO对于网站的推广很重要,大多数搜索引擎都提供了一些API用于给站长主动提交URL,加快网页被收录的速度。
百度提供了快速收录的API接口,下面这个Python脚本可以用来读取本地磁盘中的sitemap.xml文件,并调用接口提交URL至百度。
仅需要修改下面的参数:
- lastUpdateTimeStr - 上次推送的时间。会与sitemap.xml中的时间做比较,仅推送在该时间之后更新的URL
- siteMapPath - sitemap.xml在本地磁盘上的存放路径
- siteUrl - 网站地址
- baiduApiToken - Baidu API的token
- tmpFile - 临时文件的保存地址
- ignorePathPrefixes - 需要忽略的URL的前缀
1#!/usr/bin/env python3
2# coding: utf-8
3
4import xml.etree.ElementTree as ET
5from datetime import datetime
6import os
7
8
9### Methods #########
10def stripNs(el):
11 # Recursively search this element tree, removing namespaces.
12 if el.tag.startswith("{"):
13 el.tag = el.tag.split('}', 1)[1] # strip namespace
14 for k in el.attrib.keys():
15 if k.startswith("{"):
16 k2 = k.split('}', 1)[1]
17 el.attrib[k2] = el.attrib[k]
18 del el.attrib[k]
19 for child in el:
20 stripNs(child)
21
22
23### Arguments to change ####
24lastUpdateTimeStr='2021-01-26T00:00:00+08:00'
25siteMapPath='public/sitemap.xml'
26siteUrl='https://www.zengxi.net'
27baiduApiToken='faketoken'
28tmpFile="/tmp/submitSiteMap"
29ignorePathPrefixes=[
30 'https://www.zengxi.net/archives/',
31 'https://www.zengxi.net/categories/',
32 'https://www.zengxi.net/links/',
33 'https://www.zengxi.net/posts/',
34 'https://www.zengxi.net/series/',
35 'https://www.zengxi.net/tags/'
36]
37
38### CONSTANTS ###
39SITEMAP_DATETIME_FORMAT='%Y-%m-%dT%H:%M:%S%z'
40
41
42lastUpdateTime=datetime.strptime(lastUpdateTimeStr, SITEMAP_DATETIME_FORMAT)
43
44
45tree = ET.parse(siteMapPath)
46urlset = tree.getroot()
47
48with open(tmpFile, 'w') as f:
49 for url in urlset:
50 location = ''
51 lastmod = lastUpdateTime
52
53 for urlChild in url:
54 stripNs(urlChild)
55
56 if urlChild.tag == 'loc':
57 location = urlChild.text
58 elif urlChild.tag == 'lastmod':
59 lastmod = datetime.strptime(urlChild.text, SITEMAP_DATETIME_FORMAT)
60
61 ignore = False
62 for prefix in ignorePathPrefixes:
63 if location.startswith(prefix):
64 ignore = True
65 break
66
67 if ignore:
68 continue
69
70 if lastmod >= lastUpdateTime:
71 f.write(location + '\n')
72
73
74command="""
75curl -H 'Content-Type:text/plain' --data-binary @{filePath} "http://data.zz.baidu.com/urls?site={siteUrl}&token={token}"
76"""
77
78commandToExecute=command.format(filePath=tmpFile, siteUrl=siteUrl, token=baiduApiToken)
79tmpres = os.popen(commandToExecute).readlines()
80
81print(commandToExecute)
82print(tmpres)