Crawling part of the stock information of Oriental Fortune.com (1)
Goal: Crawling the name, code, change, and financing balance of some stocks of Oriental Fortune.com , Margin financing and securities lending balance, number of shareholders
Programming language: Python 3.7
Development IDE: Visual Studio 2019 ( I am used to the little black box, and I also like the interface design of VS2019, so I did not use mainstream Python Develop IDE)
The official website of Oriental Fortune.com does not directly display all the stocks, but I searched it and found the place where all the stock names and codes are displayed-all the stock information of Oriental Fortune.com. Let’s take a look at this first. The source code of the webpage
You can see that the stock name and code are all in a "li" tag, we just need to get the value in these "li" tags NS. Let's find out how many "li" tags there are by ctrl+f, and the result shows that there are 10326. We only need the part that contains the stock information, so it won't work to grab the "li" tag directly. Then we look at what the parent tag of these "li" tags is, and we find that the parent tag is "ul" tag, and the search finds that there are 34 "ul" tags, but we only need the one with stock information, so we just grab the "ul" tag. Nor does it work. Look at the parent tag of the "ul" tag and find that it is a "div" tag with an id of "", and there is only one "div" tag with an id of "" in the source code of the entire webpage. Then we directly grab this "div" tag, and then find all the "li" tags in it to crawl to the stock name and code.
(I use the get method in the re library to crawl data here. This method is only suitable for crawlers with a relatively small amount of data, so all stocks starting with "6" cannot be crawled. If you want to crawl all the stock information of Oriental Fortune.com, it is recommended to use the Scrapy framework, which is suitable for crawlers with a large amount of data)
The crawling code is as follows:
.find _ all
we print it out Take a look:
It’s not over yet, we only need the stock name and code, and no extra HTML code is needed, so we use a for loop to take out the content.
I finally feel a sense of accomplishment here, let’s take further steps to deal with what we climbed outThese stock information
Just click on the specific interface of a stock, and you can see a "margin trading" hyperlink, click in to see the stock’s historical rise and fall, financing balance, financing Securities lending balance, etc., but no shareholder accounts. Then let go of the number of shareholder accounts, and first climb out the changes we want, the balance of financing, and the balance of margin financing and securities lending.
Check the source code of the webpage, and find that the data in the table is not written directly on the webpage, which means that we can’t use get like crawling stock names and codes. The method directly captures the data. These data are displayed by loading js asynchronously through the "script" tag, which is a method of dynamic display. Most websites now use this method to update data. The previous website did not use this method, so to see the updated data, you can only refresh the page. So when you click on the webpage, the browser will send a request to the server, and then the response will be these data, we press F12 to enter the debugging mode, find the network button, select all. Then you can see a lot of request items.
Then click on the next page, let the browser initiate a new request to the server, and we can see the new request item appearing on the network. There is the next page of data in the newly appeared request item, and we found this request item.
Then open the request url link on the right in the browser, as shown below.
Then we directly use the get method to grab this link to get the specific json data.
The problem now is to generate this url containing json data, and every page of data for every stock is a new url. Observing the urls of different data pages of the same stock, it is found that only the number after the "p=" in the link has changed. This number indicates the number of pages of data.
Take this stock I selected as an example:
——p=2——&rt=53149371 is the second page of data
——p=3——&rt=53149371 is the third page of data
——p=4——&rt=53149371 is the fourth page data
Use a loop to change the value of the number after "p=" to generate URLs for different pages.
What if it is a different stock? Also observe the url and find that it contains the stock code, which is the code in the double quotation marks after "scode=". Changing this code corresponds to the new stock data url. Then there is a cycle to change the stock code to generate the url of different stock information.
At this time, the stock names and stock codes we crawled before are useful, but what we crawled before are all in the format of "stock name (stock code)". Now as long as the stock code is in it, that only Just take out the content in brackets.
Also use a loop to extract the content in the brackets of the crawled string, the code is as follows:
.get _ text
: Stock _ Number
'[ (](.*?)[) ]'
.get _ text
) Stock _ Name
.get _ text
) first _ number
= Stock _ Number
(Stock _ Name
,Stock _ Number
,first _ number
Here, an if statement is used to determine whether it is empty. If it is not written, an error will be reported, because the for statement will loop once more and process the "empty" string once more. Two regular expressions are used here to match the required content, the first time is to find the content in the brackets, and the second time is to remove the brackets and the content in the brackets. In this way, we have successfully separated the stock name and stock code, and can be used to generate data urls. (The re function returns a list, because we only have one item, so take Stock _ Number[ 0 ])Next is the generation of the url, the code is as follows:
if first _ number
'&ps=50&st =date&sr=-1&filter=(scode=%22' part3
'%22)&rt=53137855' data _ page
(data _ page