用Python读取sitemap并调用百度接口推送URL

分享到:

SEO对于网站的推广很重要,大多数搜索引擎都提供了一些API用于给站长主动提交URL,加快网页被收录的速度。

百度提供了快速收录的API接口,下面这个Python脚本可以用来读取本地磁盘中的sitemap.xml文件,并调用接口提交URL至百度。

仅需要修改下面的参数:

  • lastUpdateTimeStr - 上次推送的时间。会与sitemap.xml中的时间做比较,仅推送在该时间之后更新的URL
  • siteMapPath - sitemap.xml在本地磁盘上的存放路径
  • siteUrl - 网站地址
  • baiduApiToken - Baidu API的token
  • tmpFile - 临时文件的保存地址
  • ignorePathPrefixes - 需要忽略的URL的前缀
 1#!/usr/bin/env python3
 2# coding: utf-8
 3
 4import xml.etree.ElementTree as ET
 5from datetime import datetime
 6import os
 7
 8
 9### Methods #########
10def stripNs(el):
11  # Recursively search this element tree, removing namespaces.
12  if el.tag.startswith("{"):
13    el.tag = el.tag.split('}', 1)[1]  # strip namespace
14  for k in el.attrib.keys():
15    if k.startswith("{"):
16      k2 = k.split('}', 1)[1]
17      el.attrib[k2] = el.attrib[k]
18      del el.attrib[k]
19  for child in el:
20    stripNs(child)
21
22
23### Arguments to change ####
24lastUpdateTimeStr='2021-01-26T00:00:00+08:00'
25siteMapPath='public/sitemap.xml'
26siteUrl='https://www.zengxi.net'
27baiduApiToken='faketoken'
28tmpFile="/tmp/submitSiteMap"
29ignorePathPrefixes=[
30     'https://www.zengxi.net/archives/',
31     'https://www.zengxi.net/categories/',
32     'https://www.zengxi.net/links/',
33     'https://www.zengxi.net/posts/',
34     'https://www.zengxi.net/series/',
35     'https://www.zengxi.net/tags/'
36]
37
38### CONSTANTS ###
39SITEMAP_DATETIME_FORMAT='%Y-%m-%dT%H:%M:%S%z'
40
41
42lastUpdateTime=datetime.strptime(lastUpdateTimeStr, SITEMAP_DATETIME_FORMAT)
43
44
45tree = ET.parse(siteMapPath)
46urlset = tree.getroot()
47
48with open(tmpFile, 'w') as f:
49     for url in urlset:
50          location = ''
51          lastmod = lastUpdateTime
52
53          for urlChild in url:
54               stripNs(urlChild)
55
56               if urlChild.tag == 'loc':
57                    location = urlChild.text
58               elif urlChild.tag == 'lastmod':
59                    lastmod = datetime.strptime(urlChild.text, SITEMAP_DATETIME_FORMAT)
60          
61          ignore = False
62          for prefix in ignorePathPrefixes:
63               if location.startswith(prefix):
64                    ignore = True
65                    break
66          
67          if ignore:
68               continue
69
70          if lastmod >= lastUpdateTime:
71               f.write(location + '\n')
72
73
74command="""
75curl -H 'Content-Type:text/plain' --data-binary @{filePath} "http://data.zz.baidu.com/urls?site={siteUrl}&token={token}"
76"""
77
78commandToExecute=command.format(filePath=tmpFile, siteUrl=siteUrl, token=baiduApiToken)
79tmpres = os.popen(commandToExecute).readlines()
80
81print(commandToExecute)
82print(tmpres)