This article is for learning and communication only. If there are errors or omissions, please understand. Welcome everyone to study and discuss together!
Reference materials (thanks!) Crawling preparations for crawling ideas Module 1: Web form data crawling module 2: Add output data source code (may be modified in the near future...) Crawl nearly a month of historical transaction data Crawl nearly One year of historical transaction data
Reference materials (thanks!)
Supporting role Seven Three-How to grab the form in the web page:
https://zhuanlan.zhihu.com/p/33986020
Crawling preparation
import requestsfrom bs4 import BeautifulSoupimport pandas as pdimport osimport timeimport randomCrawling ideas
Find the webpage where the data is located, use developer tools to view the webpage url, request status, source code, etc., and then locate the data element. Subsequently, programming is performed. Use related functions to simulate access to web pages, collect data, process them, and save them locally. (The details are not in place, please forgive me, the blogger will find time to summarize separately)
Module 1: Web form data crawling
def get _ stock _ table(stockcode,i): url ='http://vip.stock.finance.sina.com.cn/corp/go.php/vMS _ MarketHistory/stockid/' + str( stockcode) +'.phtml?year=2019&jidu=' + str(i) print(url) res = requests.get(url) res.encoding ='gbk' soup = BeautifulSoup(res.text,'lxml') tables = soup.find _ all('table', {'id':'FundHoldSharesTable'}) df _ list = [] for table in tables: df _ list.append(pd.concat(pd.read _ html(table.prettify()))) df = pd.concat(df _ list) df.columns = df.iloc[ 0] headers = df.iloc[ 0] df = pd.DataFrame(df.values[ 1: ], columns=headers) #print(len(df)-1) #df There are several rows of data if (len(df)-1 <22): c =len(df)-1 df = add _ stock _ table(stockcode,i,c,df)else: df =pd.DataFrame(df.values[ 1:22 ], columns=headers) df = df.reset _ index(drop=True) df.to _ excel('...\\'+str(stockcode) +'.xlsx') sleeptime = random.randint(1, 10) #print(sleeptime) time.sleep(sleeptime)However, the above function may sometimes not get one month's worth of data, so we need to add another function to add data. If you want to get a year, you need to add a loop
Module 2: Add output data
def add _ stock _ table(stockcode,i,c,df): i = i-1 url ='http://vip.stock.finance.sina.com.cn/corp/go.php/vMS _ MarketHistory/stockid/' + str(stockcode) +'.phtml?year=2019&jidu=' + str( i) #print(url) res = requests.get(url) res.encoding ='gbk' soup = BeautifulSoup(res.text,'lxml') tables = soup.find _ all('table', {'id':'FundHoldSharesTable'}) df _ addlist = [] for table in tables: df _ addlist.append(pd.concat(pd.read _ html(table.prettify()))) df _ add = pd.concat(df _ addlist) headers = df _ add.iloc[ 0] df _ add = pd.DataFrame(df _ add.values[ 1:random.randint(20, 22)-c ], columns=headers) #print(df _ add) df _ sum = df.append(df _ add) #print(df _ sum) #print(len(df _ sum)-1) return df _ sum
Remember! This article is for learning and communication only. If there are any mistakes or omissions, please forgive me and welcome suggestions! The blogger is more (lazy) than the buddha, please modify it as you wish!
Source code (may be revised in the near future...)
Note:
The source code cannot be directly applied!
The path of the .xlsx file needs to be modified! !
Others make changes according to their needs.
Crawl historical transaction data for the past month
from bs4 import BeautifulSoupimport requestsimport pandas as pdimport osimport timeimport randomdef get _ stock _ table(stockcode,i): url ='http://vip.stock.finance.sina.com.cn/corp/go.php/vMS _ MarketHistory/stockid/' + str( stockcode) +'.phtml?year=2019&jidu=' + str(i) print(url) res = requests.get(url) res.encoding ='gbk' soup = BeautifulSoup(res.text,'lxml') tables = soup.find _ all('table', {'id':'FundHoldSharesTable'}) df _ list = [] for table in tables: df _ list.append(pd.concat(pd.read _ html(table.prettify()))) df = pd.concat(df _ list) df.columns = df.iloc[ 0] headers = df.iloc[ 0] df = pd.DataFrame(df.values[ 1: ], columns=headers) #print(len(df)-1) #df There are several rows of data if (len(df)-1 <22): c =len(df)-1 df = add _ stock _ table(stockcode,i,c,df) else: df =pd.DataFrame(df.values[ 1:22 ], columns=headers) df = df.reset _ index(drop=True) df.to _ excel('...\\'+str(stockcode) +'.xlsx') sleeptime = random.randint(1, 10) #print(sleeptime) time.sleep(sleeptime)def add _ stock _ table(stockcode,i,c,df): i = i-1 url ='http://vip.stock.finance.sina.com.cn/corp/go.php/vMS _ MarketHistory/stockid/' + str(stockcode) +'.phtml?year=2019&jidu=' + str( i) #print(url) res = requests.get(url) res.encoding ='gbk' soup = BeautifulSoup(res.text,'lxml') tables = soup.find _ all('table', {'id':'FundHoldSharesTable'}) df _ addlist = [] for table in tables: df _ addlist.append(pd.concat(pd.read _ html(table.prettify()))) df _ add = pd.concat(df _ addlist) headers = df _ add.iloc[ 0] df _ add = pd.DataFrame(df _ add.values[ 1:random.randint(20, 22)-c ], columns=headers) #print(df _ add) df _ sum = df.append(df _ add) #print(df _ sum)#print(len(df _ sum)-1) return df _ sumif _ _ name _ _ == "_ _ main _ _ ": if os.path.exists("...\\601006.xlsx") == True: os.remove("...\\601006.xlsx")stockcode = ['601006', '000046', '601398', '000069', '601939', '000402', '000001', '000089', '000027', '399001', '000002', '000800', '601111', '600050', '601600', '600028', '601857', '601988', '000951', '601919']i=2index = 1print("Crawling month _ stock information...\n")print("---------------\n")print("Please wait patiently...\n")for x in stockcode: print(index) get _ stock _ table(x,i) index +=1Crawling historical transaction data for the past year
from bs4 import BeautifulSoupimport requestsimport pandas as pdimport osimport timeimport randomdef get _ stock _ yeartable(stockcode,s,y): url ='http://vip.stock.finance.sina.com.cn/corp/go.php/vMS _ MarketHistory/stockid/' + str( stockcode) +'/type/S.phtml?year=' + str(y) +'&jidu=' + str(s) print(url) res = requests.get(url) res.encoding ='gbk' soup = BeautifulSoup(res.text,'lxml') tables = soup.find _ all('table', {'id':'FundHoldSharesTable'}) df _ list = [] for table in tables: df _ list.append(pd.concat(pd.read _ html(table.prettify()))) df = pd.concat(df _ list) df.columns = df.iloc[ 0] headers = df.iloc[ 0] df = pd.DataFrame(df.values[ 1: ], columns=headers) #print(len(df)-1) #df There are several rows of data while len(df)0: df = add _ stock _ table(stockcode,s,y,df) s -= 1 s = 5 y -= 1 df = df.reset _ index(drop=True) df = pd.DataFrame(df.values[ 1:250 ], columns=headers) df.to _ excel('D:\\Workplace\\PyCharm\\MySpider\\sh'+str(stockcode) +'.xlsx') sleeptime = random.randint(1, 10) #print(sleeptime) time.sleep(sleeptime)def add _ stock _ table(stockcode,s,y,df): print(y,"-",s) url ='http://vip.stock.finance.sina.com.cn/corp/go.php/vMS _ MarketHistory/stockid/' + str( stockcode) +'/type/S.phtml?year=' + str(y) +'&jidu=' + str(s) print(url) res = requests.get(url) res.encoding ='gbk' soup = BeautifulSoup(res.text,'lxml') tables = soup.find _ all('table', {'id':'FundHoldSharesTable'}) df _ addlist = [] for table in tables: df _ addlist.append(pd.concat(pd.read _ html(table.prettify()))) df _ add = pd.concat(df _ addlist) headers = df _ add.iloc[ 0] df _ add = pd.DataFrame(df _ add.values[ 1: ], columns=headers) #print(df _ add) df _ sum = df.append(df _ add) #print(df _ sum) #print(len(df _ sum)-1) return df _ sumif _ _ name _ _ == "_ _ main _ _ ": if os.path.exists("D:\\Workplace\\PyCharm\\MySpider\\sh000001.xlsx") == True: os.remove("D:\\Workplace\\PyCharm\\MySpider\\sh000001.xlsx")stockcode = ['000001']s = 2y = 2019index = 1print("Crawling year _ sh _ stock information...\n")print("---------------\n")print("Please wait patiently...\n")for x in stockcode: print(index) get _ stock _ yeartable(x,s,y)Article Url:https://www.liaochihuo.com/info/555151.html
Label group:[python] [python function] [table]