Stock Liao information

— Basic knowledge of stocks|Introduction to basics of stocks|Stock learning|Basic knowledge of stocks
Mobile access:m.liaochihuo.com

Python crawler combat|Using multithreading to crawl LOL HD wallpaper

Release Time:2022-06-04 Topic:stock market bull pictures wallpaper Reading:98 Navigation:Stock Liao information > Game > Python crawler combat|Using multithreading to crawl LOL HD wallpaper phone-reading

The official website interface is shown in the figure. Obviously, a small picture represents a hero. Our purpose is to crawl all the skin pictures of each hero, download and save them all to the local.

Secondary page

The above page is called the main page, and the secondary page is the page corresponding to each hero. Take the Dark Lady as an example. Its secondary page is as follows:

We can see that there are many small pictures, Each thumbnail corresponds to a skin. View the skin data interface through the network, as shown in the following figure:

< p data-from-paste="1">We know that the skin information is transmitted as a string in json format, then we only need to find the id corresponding to each hero, find the corresponding json file, and extract the required data. Get HD skin wallpapers.

Then the json file address of the Dark Lady is:

hero _ one = ''

The rule here is actually very simple, the address of each hero's skin data is as follows:

url = '{}.js'< span data-fake="1" data-diagnose-id="6276c4a79b336b99cf404a8fa65c318f" data-from-paste="1">.format(id)

Then the question is what is the rule of id? Here, the id of the hero needs to be viewed on the homepage, as shown below:

< p data-from-paste="1">

We can see two lists [0,99],[ 100,156 ], which is 156 heroes, but heroId goes all the way up to 240 ...., it can be seen that it has a certain law of change , not adding one in turn, so to crawl all hero skin pictures, you need to get all heroIds first.

3. Grasp the idea Why use multithreading? Explain here. When we crawl data such as pictures and videos, we need to save it locally, so we will use a large number of file read and write operations, that is, IO operations. Imagine if We perform a synchronous request operation;

Then the second request will be made after the first request is completed until the file is saved locally, which is very inefficient. If multi-threading is used for asynchronous operations, the efficiency will be greatly improved.

So it is necessary to use multi-threading or multi-process, and then throw so many data queues to the thread pool or process pool for processing;

In Python, multiprocessing Pool process pool, multiprocessing.dummy is very useful.

multiprocessing.dummyModule:dummyModule is multithreaded;

multiprocessingModule:multiprocessingis multiprocessing;

multiprocessing.dummymodules andmultiprocessingThe APIs of both modules are common, and the code switching is more flexible;

We first test in a demo. py file to grab the hero id, I have already written the code here to get a list of stored hero ids, which can be used directly in the main file;

< strong data-from-paste="1" data-diagnose-id="3e36d23f47c5a6f9faba21a2600a 0903">demo.py

url=''
res = re.get(url,headers=headers)
res = res.content.decode( 'utf-8')
res _ dict = json.loads(res )
heros = res _ dict[ "hero" ] # 156 heroes Information
idList = [ ]
for hero in heroes:
hero _ id = hero[ "heroId" ]
(hero _ id)
print(idList)

Get the idList as follows:

idlist = [ 1,2,3,….,875,876,877 ] # The hero id in the middle is not displayed here

Constructed url:

page = '{}.html'.format(i)

The i here represents the id, and the url is dynamically constructed;

Then we customize two functions, one for crawling and parsing the page (spider), the other for downloading data (download), opening the thread pool, and using a for loop to build and store hero skin json The url of the data, stored in a list, as a url queue, usepool.map() The method executes the spider function;

def map(self, fn, * iterables, timeout=None, chunksize=1):
"""Returns an iterator e to map(fn, iter) """
# Here our use is: pool.map(spider,page) # spider: crawler function; page: url queue< br data-from-paste="1" data-diagnose-id="a01864b58227f25f4eeb8dbc0b689aec">

action: extracts each element in the list As a function parameter, create aprocess and put it into the process pool;

Parameter 1: function to execute;

Parameter 2: iterator, pass the numbers in the iterator as parameters to the function in turn;

json data parsing

Here we will parse the json file of the dark girl's skin for display. The content we need to get is 1.name,2.skin_name,3.mainImg , because we found herThe oName is the same, so use the hero name as the hero's skin folder name, which is easy to view and save;

item = {}
item[ 'name' ] = hero[ "heroName" ]
item[ 'skin _ name' ] = hero[ "name" ]
if hero[ "mainImg" ] == '':
continue
item[ 'imgLink' ] = hero[ "mainImg" ]

There is a note:

Some mainImg tags are empty, so we need to skip, otherwise if it is an empty link, An error will be reported when requesting;

4. Data collection

Import Related 3rd Party Libraries

import re # request
from multiprocessing.dummy import Pool as ThreadPool # Concurrent
import time # efficiencyimport os # File Operations
import json # parse

Page data analysis

< span data-from-paste="1" data-diagnose-id="d5e54ec18c4a9438eec1c936d34fb9c5">def spider(url):

res = re.get(url, headers=headers)
< span data-fake="1" data-diagnose-id="ca4a3b2d48694d3d3d2e1e2bc85e1f39" data-from-paste="1"> result = res.content.decode('utf-8')
res_dict = json.loads(result)

skins = res _ dict[ "skins" ] # 15 hero posts
print(len(skins))

for index,hero in enumerate(skins): # Here use enumerate to get the subscript, so that the file image can be named;
item = {} < span data-from-paste="1" data-diagnose-id="af4764d68807a5539a107557cc719ace"># dictionary object
item[ 'name' ] = hero[ "heroName" ]
item[ 'skin _ name' ] = hero[ "name" ]

if hero[ "mainImg" ] == '':
continue
item[ 'imgLink' ] = hero[ "mainImg" ]
print(item)

download(index+1,item )

download Download Image

def download(index,contdict) :
name = contdict[ 'name' ]
path = "Skin/" + name
if not os.path.exists(path):
os.makedirs(path)
content = re.get(contdict[ 'imgLink' ], headers=headers).content
with open('./skin/' + name + '/' + contdict[ 'skin_name' ] + str(index) + '. jpg', 'wb') as f:
f.write(content)

Here we use the OS module to create a folder. As we mentioned earlier, the value of each hero's heroName is the same, so as to create a file Folder and name, it is convenient to save (categorize) the skin, and then the path of the image file here needs to be careful, one less slash will report an error.

main() main function

def main():
< span data-fake="1" data-diagnose-id="3d6ff1f66b941fdbf64629bdafbbfaf3" data-from-paste="1"> pool = ThreadPool(6)
page = [ ]
for i in range(1,21):
newpage = '{}.js'.format(i)
print(newpage)
(newpage)
< span data-fake="1" data-diagnose-id=" 1db27e9b794800d170451decc17f2595" data-from-paste="1"> result = pool.map(spider, page)
pool.close()
pool.join()
end = time.time()

Description:

< span data-from-paste="1" data-diagnose-id="d7245d573bedfae403ccc064007fe5b7">In the main function, we prefer to create six thread pools;

Construct 20 urls dynamically through a for loop, let's try it out, 20 hero skins, if you crawl all of them, you can The previous idList is traversed, and then the url is dynamically constructed;

Use the map() function to perform data parsing and storage operations on the url in the thread pool;

When the thread pool is closed, the thread pool is not closed, but the state is changed to the state where elements cannot be inserted again;

V. Program running

if _ _ name _ _ = = ' _ _ main _ _ ':
main()

The result is as follows:

Of course, this is just an interception For some images, a total of 200+ images have been crawled, which is generally OK.

Article Url:https://www.liaochihuo.com/info/672860.html

Label group:[python] [python function] [python crawler

Hot topic

Game recommend

Game Popular