分享web开发知识

注册/登录|最近发布|今日推荐

主页 IT知识网页技术软件开发前端开发代码编程运营维护技术分享教程案例
当前位置:首页 > 网页技术

gxmatmars / 利用wget爬取网站内容并用ruby进行数据处理

发布时间:2023-09-06 01:34责任编辑:胡小海关键词:暂无标签
pachong.rb
 
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
URL = ‘bangumi.tv/character/‘
 
READY = []
Dir.glob(‘download/*‘).each do |f|
if f =~ /download\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘fail/*‘).each do |f|
if f =~ /fail\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘error/*‘).each do |f|
if f =~ /error\/(\d+)/
READY << $1.to_i
end
end
 
READY.uniq!
 
def download(i)
log = ‘‘
fn = i.to_s
system "wget #{URL}#{fn}"
 
lines = []
 
if !FileTest.exist?(fn)
return ‘‘
end
 
File.open(fn, ‘r‘) do |f|
lines = f.readlines
end
 
find = false
lines.each do |l|
if l =~ /<title>(.+)<\/title>/
name, description = $1.split(‘|‘).collect { |e| e.strip }
log << "#{i}: #{name}, #{description}\n"
end
if l =~ /href="(.+)" class="cover thickbox"/
url = ‘http:‘ + $1
url.slice!(/\?.+$/)
log << url + "\n"
system "wget #{url}"
system "rm #{fn}"
find = true
break
end
end
 
if !find
system "mv #{fn} fail\\"
log << "\n"
end
 
return log
end
 
i = ARGV[0].to_i
n = ARGV[1].to_i
 
log = ‘‘
 
n.times do
log << download(i) if !READY.include?(i)
i += 1
end
 
system "mv *.jpg download\\"
 
File.open(‘pachong.txt‘, ‘a‘) do |f|
f << log
end
readme.md
 

before running

  1. install wget and ruby.
  2. create folder download and fail
  3. modified forloop.bat,
    • line5, (start, step = 50, end = start + 1000). (20 threads).
    • line7, second parameter for pachong.rb should >= step
  4. run forloop.bat
  5. When mostly all pictures are downloaded, run ruby run.rb 50

tips

  1. This script may lose some picture. Just try more times, pictrue in folder would be ignored.
  2. If any cmd window get stuck, press enter to skip current wget command.
forloop.bat
 
12345678
@echo off
mkdir download
mkdir fail
mkdir error
for /l %%i in (30001,500,40000) do (
@ping 127.0.0.1 -n 1 >nul
start /min cmd /c ruby pachong.rb %%i 500
)
run.rb
 
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
Dir.glob(‘*‘).each do |f|
if f =~ /^\d+/
system "mv #{f} error\\"
end
end
system "mv *.jpg download\\"
 
Limit = ARGV[0]? ARGV[0].to_i : 50
 
READY = []
Dir.glob(‘download/*‘).each do |f|
if f =~ /download\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘fail/*‘).each do |f|
if f =~ /fail\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘error/*‘).each do |f|
if f =~ /error\/(\d+)/
READY << $1.to_i
end
end
 
r = READY.sort
show = true
j = 0
 
start = []
step = []
 
for i in 20001..40000
if show
if !r.include?(i)
start << i
show = !show
j = i
end
else
if r.include?(i)
step << i - j
print "#{j} -> #{i} : #{i-j}\n"
show = !show
end
end
end
 
print "total: #{step.sum}\n"
 
n = 0
i = 0
while start[i]
if step[i] > Limit
if step[i] > 2 * Limit
start << start[i] + 2 * Limit
step << step[i] - 2 * Limit
step[i] = 2 * Limit
end
start[i] += 1
printf "#{start[i]} + #{step[i]}\n"
system "start /min cmd /c ruby pachong.rb #{start[i]} #{step[i]}"
sleep(1)
n += 1
break if n > 20
end
i += 1
end

gxmatmars / 利用wget爬取网站内容并用ruby进行数据处理

原文地址:https://www.cnblogs.com/PHPnetc/p/8228942.html

知识推荐

我的编程学习网——分享web前端后端开发技术知识。 垃圾信息处理邮箱 tousu563@163.com 网站地图
icp备案号 闽ICP备2023006418号-8 不良信息举报平台 互联网安全管理备案 Copyright 2023 www.wodecom.cn All Rights Reserved