进击的Spider-CSDN博客

原创日期格式转换

1.<class 'datetime.date'> 转换为<class 'str'>类型date_1 = datetime.date(year, mon+1, day2).strftime("%Y-%m-%d")2.# 获取当前时间格式：2018-12-18now_date = time.strftime('%Y-%m-%d', time.localt...

2021-08-30 22:17:12 110

原创 window系统安装scrapy 报错：Microsoft Visual C++ 14.0 is required……

window系统安装scrapy 会报缺少gcc+ 安装错误解决building 'twisted.test.raiser' extensionerror: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualst...

2019-02-01 14:45:57 236

原创日期时间转换

import datetimenewsTime='Sun 23 Apr 2017 05:15:05'GMT_FORMAT = '%a %d %b %Y %H:%M:%S'newsTime=datetime.datetime.strptime(newsTime, GMT_FORMAT)print(newsTime)#2017-04-23 05:15:05

2018-12-24 23:37:14 243

原创数据解析保存csv文件提取数据遇到逗号如何处理

https://blog.csdn.net/lanji1988/article/details/60139600 csv写入时指定表头https://blog.csdn.net/zn505119020/article/details/77480969

2018-12-21 15:41:15 4647

原创 requests（打印日志 | 连接数据库 | 获取动态代理 | 爬取数据）

import requestsimport loggingimport timeimport jsonimport pymysqlimport os# 打印日志log_name = 'sb_spider_log.log'logging.basicConfig( # 日志输出信息 filename=log_name, filemode='a', leve...

2018-12-16 22:40:57 821

转载用virtualenv管理Python3运行环境

参考网址：https://www.cnblogs.com/hiddenfox/p/virtualenv-python3.html

2018-12-11 00:31:14 127

转载用virtualenv管理Python3运行环境

用virtualenv管理Python3运行环境：https://www.cnblogs.com/hiddenfox/p/virtualenv-python3.html 解决centos7 中安装virtualenvwrapper配置时报错virtualenvwrapper.sh文件无法找到：https://blog.csdn.net/hpwzjz/article/details/...

2018-12-11 00:31:14 170

原创进程线程协程的区别

https://www.cnblogs.com/lei0213/p/8393323.html### 进程池一个进程占用一个CPU，占用一定的内存空间。一般CPU配置都是4核，如果开的进程太多，其他的程序就得等着。###### 什么情况家使用多进程？CPU是用来计算的。所以在CPU密集运算的情况下，才使用多进程。具体要开几个进程，根据机器的实际配置和实际生产情况而定。### ...

2018-12-09 16:51:34 1330

原创单例模式

class A(object): instance = None def __new__(cls, *args, **kargs): if cls.instance is None: cls.instance = super().__new__(cls) return cls.instancei1 = A()...

2018-12-08 23:03:26 75

原创进程

首先，先从多任务讲起：现代操作系统（Windows、Mac OS X、Linux、UNIX等）都支持"多任务"什么叫多任务？？？操作系统同时可以运行多个任务早期电脑都是单核cpu，他执行任务原理：操作系统轮流让各个任务交替执行，QQ执行2Us，切换到微信，执行2Us，再切换到陌陌，执行2Us，……。表面上看，每个任务反复执行下去，但是CPU调度执行太快了，导致我们感觉就像所有任务...

2018-11-30 23:47:04 102

原创线程池

#! /usr/bin/env python# -*- coding: utf-8 -*-# see https://www.cnblogs.com/zhang293/p/7954353.htmlimport timefrom concurrent.futures import ThreadPoolExecutordef say_hello(a): print("hell...

2018-11-29 23:15:06 99

原创取余和取模的区别

>> mod(5,2)ans =1 % 除数是正，余数就是正>> mod(-5,2)ans =1>> mod(5,-2)ans =-1 % 除数是负，余数就是负>> mod(-5,-2)ans =-1 % 用 rem 时，不管除数是...

2018-11-29 13:45:34 178

原创列表转字符串，列表转元组，元组转列表

''' 将列表转换成字符串 '''list1 = [str(x) for x in range(10)]print(type(list1[0]))str1 = ''.join(list1)print(str1)list1 = ['abe', 1, 3, 4, 'c']list_str = [str(x) for x in list1]str2 = ''.join(list_st...

2018-11-27 22:48:04 2377

原创冒泡排序，快速排序，选择排序，二分查找

'''冒泡排序原理: 每一趟只能将一个数归位, 如果有n个数进行排序,只需将n-1个数归位, 也就是说要进行n-1趟操作(已经归位的数不用再比较)，每一次都是得到比较列表中最大的数。冒泡排序算法及其优化冒泡排序的基本特征是只能交换相邻的元素。从下边界开始，一趟扫描下来，可以把当前最大值顶到上边界；如果没有发生交换操作，则表示数组是有序的'''def bubble_sort(se...

2018-11-27 22:20:35 317

原创使用 selenium 下载小视频

#!/usr/bin/env python# -*- coding:utf-8 -*-from selenium import webdriverimport urllibimport urllib.requestimport time# 普通爬虫，使用Charles抓包获取网址url，但是西瓜视频url只能使用一次# url加密，只能使用selenium获取数据，浏览器url...

2018-11-24 15:18:36 2423 1

原创 pytesseract 使用简介

#!/usr/bin/env python# -*- coding:utf-8 -*-import pytesseractfrom captcha.image import ImageCaptchaimport randomimageCaptcha = ImageCaptcha()chars = []# A ~ Zfor i in range(65,91): ...

2018-11-24 15:16:48 1210

原创 pytesseract 识别验证码

使用 tesseract 技术，练习验证码识别技术#!/usr/bin/env python# -*- coding:utf-8 -*-import pytesseractimport urllibimport urllib.requestfrom PIL import Imageurl = 'https://so.gushiwen.org/RandCode.as...

2018-11-24 15:14:53 227

原创斗鱼爬虫 -- selenium技术

#!/usr/bin/env python# -*- coding:utf-8 -*-import timefrom selenium import webdriverchrome = webdriver.Chrome()# invalid selector xpath语句错误# 标题# titles = chrome.find_elements_by_xpath('//...

2018-11-21 22:39:35 313

原创 global、nonlocal 作用域

一句话：global：你要重新赋值，就要声明；不声明，就只能引用变量nonlocal：不管你用，还是赋值，都需要声明！！！注：这里的用的意思是：使用该变量做运算，但是不改变该变量名的原始值赋值的意思是：将该变量名重新赋一个新值！！！global：使用global关键字声明：变量名 -----> 为全局变量例： global ...

2018-11-19 18:19:08 118

原创爬取西刺免费代理，并验证IP的有效性

#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport urllibimport urllib.requestfrom bs4 import BeautifulSoupfrom http import clientfrom threading import Threadfrom threa...

2018-11-19 15:22:15 873

原创腾讯招聘爬虫

#!/usr/bin/env python# -*- coding: utf-8 -*-import urllibimport urllib.requestfrom bs4 import BeautifulSoupurl = 'https://hr.tencent.com/position.php?&start=%d'def parse(html,fp): ...

2018-11-16 16:12:14 209

原创智联招聘爬虫

#!/usr/bin/env python# -*- coding: utf-8 -*-import jsonimport requestsfrom bs4 import BeautifulSoupimport urllib.parseurl = 'http://sou.zhaopin.com/jobs/searchresult.ashx?'headers = {'User...

2018-11-16 15:53:00 2243

原创 oh-my-zsh ，“zsh: command not found: mysql”

1,关于zsh ，请移步介绍：终极 Shell——ZSH 官网：oh-my-zsh2,安装完zsh，在使用相关shell命令，出现了”zsh: command not found adb:mysql”,”zsh: command not found: git” 等一系列error3，解决办法既然是.zshrc 没有配置相关环境变量设置，把 bash 中.bash_profile 全...

2018-11-10 09:17:40 3837 1

转载 Python之列表生成式、生成器、可迭代对象与迭代器

https://www.cnblogs.com/yyds/p/6281453.html

2018-11-09 00:28:35 134

原创 Python 字典(Dictionary)

字典是一种可变容器模型，且可存储任意类型对象。字典的每一个键值 key value 对用冒号：分割，每个键值对之间用逗号，分割，整个字典包括在花括号{ }中，格式如下所示：d = {key1 : value1, key2 : value2 }键是唯一性的，如果键有重复，最后的一个键值对会替换前面的，值不需要唯一。>>>dict = {'a': 1, 'b': ...

2018-11-08 21:01:26 125

转载 Python 知识结构图

分享几个Python的网址，大家一起进步！ Python的知识结构图：http://lib.csdn.net/base/python/structure 生成器、lamda表达式及map、reduce、filter函数http://lib.csdn.net/article/python/1297 装饰器：http://lib.csdn.net/article/python...

2018-11-08 18:19:40 1182

转载 Python 装饰器的详细介绍

import timedef timer(parameter): def outer_wrapper(func): def wrapper(*args, **kwargs): if parameter == 'task1': start = time.time() func(*...

2018-11-08 18:01:25 168

原创 bs4 的使用方法

#!/usr/bin/env python# -*- coding: utf-8 -*-# bs4 是三方库，在调用的时候，要先在cmd中下载 pip install bs4import bs4from bs4 import BeautifulSoupcontent = '''<!DOCTYPE html><html lang="en"><head...

2018-11-07 21:49:33 2127

原创 os.path.splitext() 的使用方法

#!/usr/bin/env python# -*- coding: utf-8 -*-import urllibimport urllib.requestimport osimport requestsfrom lxml import etreeget_url = 'http://sc.chinaz.com/tupian/huangsetupian.html'head...

2018-11-07 17:35:25 6657

转载 selenium 使用的相关问题

一、浏览器驱动文件（如phantomjs.exe/chromedriver.exe）的自动读取 1、PhantomJS与chrome的区别： chrome不用说，就是chrome浏览器嘛，使用chromedriver.exe文件就是用来启动chrome浏览器的，当在程序中运行chromedriver.exe时会自动启动chrome浏览器（前提是电脑中已经安装了ch...

2018-11-06 19:40:26 97

原创使用pycharm对构建FormData格式替换的技巧

参考网址：https://www.cnblogs.com/thunderLL/p/6701374.html

2018-11-02 21:24:36 95

原创 requests( proxy | post | session ) 使用练习

import requestsimport osurl = 'http://pic.gooooal.com/images/100452/100452654.jpg'# proxies = {'http':'220.184.213.250:808'}# 私密代理ip = os.environ.get('proxyServer')user = os.environ.get('p...

2018-11-02 14:29:41 787

原创 requests Web客户端授权验证

import requests# 登入本机服务器url = 'http://127.0.0.1:80/'# 服务器验证，用户名和密码auth = ('Mchael', '123456')# 它的验证直接在get里面传参数就可以了，比起urllib的服务器验证简单得多。response = requests.get(url=url, auth=auth)response.en...

2018-11-02 13:42:08 474

原创 requests (第三方库）基于urllib3 使用更加方便

#!/usr/bin/env python# -*- coding: utf-8 -*-# requests 是一个用python语言写的第三方库，在使用的时候，需要手动安装(pip install requests)# 非常好用，基于urllib3import requests'''requests 请求的函数# requests.Request# requests.r...

2018-11-02 12:48:50 1112

原创关于 Web客户端的授权认证反爬处理技术

import urllibimport urllib.request# urllib.error.HTTPError: HTTP Error 401: UNAUTHORIZEDurl = 'http://127.0.0.1:80/'# web授权，可以特定Handlerpwmgr = urllib.request.HTTPPasswordMgrWithDefaultR...

2018-11-02 00:18:21 189

原创豆瓣爬虫 (CookieJar练习：爬取用户登入后的响应页面)

#!/usr/bin/env python# -*- coding: utf-8 -*-import urllibimport urllib.requestimport urllib.parsefrom http.cookiejar import CookieJarimport ssl# 全局取消证书验证# ssl._create_default_https_context ...

2018-11-01 22:08:38 550

原创使用handler技术，处理封闭IP的反爬技术

#!/usr/bin/env python# -*- coding: utf-8 -*-import urllibimport urllib.requestimport os# 将用户名、密码、ip保存到环境变量中proxy_server = os.environ.get('proxyServer')user = os.environ.get('proxyuser')pas...

2018-10-31 23:28:38 135

原创肯德基爬虫 (案例练习：ajax、post)

#!/usr/bin/env python# -*- coding: utf-8 -*-import urllibimport urllib.requestimport urllib.parse# ajax postpost_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=pid'headers = ...

2018-10-31 17:53:30 781

原创百度贴吧爬虫(案例练习：GET 请求)

#!/usr/bin/env python# -*- coding: utf-8 -*-import urllib.requestimport urllib.parseimport sslget_url = 'http://tieba.baidu.com/f?kw=%s&ie=utf-8&pn=%d'# 全局取消证书验证ssl._create_defaul...

2018-10-31 16:08:52 426

原创百度翻译爬虫（案例练习：POST 请求）

#!/usr/bin/env python# -*- coding:utf-8 -*-import urllib.requestimport urllib.parseimport jsonimport ssl# 通过抓包工具，获取接口post_url = 'https://fanyi.baidu.com/v2transapi'headers = { 'User-Age...

2018-10-30 21:45:32 1182 1

空空如也

空空如也