Authors :
Yisong Wang; Dongmei Zhang
Volume/Issue :
Volume 6 - 2021, Issue 11 - November
Google Scholar :
http://bitly.ws/gu88
Scribd :
https://bit.ly/3G8qCIb
Abstract :
With the advent of the data age, the extraction
and utilization of data information has become a huge
challenge. The crawler algorithm is designed to obtain
website information in batches. However, the use of some
malicious crawlers has interfered with the normal
business and operation of the website, such as website
ticket grabbing behavior and so on. So anti-reptiles was
proposed as a new research topic. From the initial frontend anti-crawler, an anti-crawler system based on big
data emerged, which greatly improved the efficiency of
anti-crawler. The purpose of this topic is to develop an
anti-crawler system. After conducting certain research on
anti-crawler strategies and technologies, it is determined
that the system functions include data classification, data
landing, data processing, data access, and ip sensitive
representation. The goal is to meet the anti-crawler needs
of ticketing websites, ensure normal business operations,
and improve user satisfaction. The system adopts
technologies such as spark, redis, kafka, nginx + lua, and
uses idea as a development tool. After the development of
the system is completed, it has undergone functional and
performance tests. Its functions are simple and
convenient, with good accuracy, and good scalability,
which can meet development needs.
Keywords :
anti-reptile; hadoop; spark; redis; kafka; nginx
With the advent of the data age, the extraction
and utilization of data information has become a huge
challenge. The crawler algorithm is designed to obtain
website information in batches. However, the use of some
malicious crawlers has interfered with the normal
business and operation of the website, such as website
ticket grabbing behavior and so on. So anti-reptiles was
proposed as a new research topic. From the initial frontend anti-crawler, an anti-crawler system based on big
data emerged, which greatly improved the efficiency of
anti-crawler. The purpose of this topic is to develop an
anti-crawler system. After conducting certain research on
anti-crawler strategies and technologies, it is determined
that the system functions include data classification, data
landing, data processing, data access, and ip sensitive
representation. The goal is to meet the anti-crawler needs
of ticketing websites, ensure normal business operations,
and improve user satisfaction. The system adopts
technologies such as spark, redis, kafka, nginx + lua, and
uses idea as a development tool. After the development of
the system is completed, it has undergone functional and
performance tests. Its functions are simple and
convenient, with good accuracy, and good scalability,
which can meet development needs.
Keywords :
anti-reptile; hadoop; spark; redis; kafka; nginx