您现在的位置是:网站首页> PY&Rust

web自动化|pyppeteer的使用-python版本puppeteer

  • PY&Rust
  • 2023-05-30
  • 972人已阅读
摘要

web自动化|pyppeteer的使用-python版本puppeteer

一、介绍


puppeteer: web自动化测试-puppeteer入门与实践


pyppeteer:puppeteer的非官方python库.支持python3.5|3.6|3.7


二、环境准备


1.安装python3


2.安装pypptr


python3 -m pip install pyppeteer


三、实例


import asyncio

from pyppeteer import launch

async def main():

    browser = await launch()

    page = await browser.newPage()

    await page.goto('http://www.baidu.com')

    await page.screenshot({'path': 'baidu.png'})



    dimensions = await page.evaluate('''() => {

        return {

            width: document.documentElement.clientWidth,

            height: document.documentElement.clientHeight,

            deviceScaleFactor: window.devicePixelRatio,

        }

    }''')

    print(dimensions)

    # >>> {'width': 800, 'height': 600, 'deviceScaleFactor': 1}

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

执行: python3 pypptr-demo.py


![image.png](https://upload-images.jianshu.io/upload_images/2054612-209615e999320d8d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)


第一次执行回去下载chromium,执行后看到控制台有打印信息,在工程目录下有截图。


四、puppeteer与pyppeteer的不同点


puppeteer与pyppeteer大部分情况下是很相同的,由于javascript与python的不同语言特性让这两者有了区别。


1.参数


javascript:


    const browser = await puppeteer.lauch({headless:true})

python:


        browser = await launch({'headless':'True'})

or    browser = await launch(headless=True)

人对于pypptr 即支持字典也支持Keyword风格的参数。


2.元素选择器方法名($ -> querySelector)


在python中,$不能用于方法名。因此,pyppeteer使用Page. queryselector ()/Page.queryselectorall()/Page.xpath()代替Page.$()/Page.$$()/Page.$x(). Pyppeteer还为这些方法提供了缩写,Page.J()、Page.JJ()和Page.Jx()。


puppeteer:


  await page.$('#kw')

pypptr:


  await page.queryselector('#kw')

or await page.J('#kw')

五、使用问题


1.Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:8....


解决:


pip3 install --upgrade certifi 

open /Applications/Python 3.6/Install Certificates.command


例子代码:


import asyncio, time

from pyppeteer import launch


async def main():

    browser = await launch(headless=False, dumpio=True, autoClose=False,

                           args=['--no-sandbox', '--window-size=1920,1080', '--disable-infobars'])   # 进入有头模式

    page = await browser.newPage()           # 打开新的标签页

    await page.setViewport({'width': 1920, 'height': 1080})      # 页面大小一致

    await page.goto('https://www.baidu.com/?tn=99669880_hao_pg') # 访问主页


    # evaluate()是执行js的方法,js逆向时如果需要在浏览器环境下执行js代码的话可以利用这个方法

    # js为设置webdriver的值,防止网站检测

    await page.evaluate('''() =>{ Object.defineProperties(navigator,{ webdriver:{ get: () => false } }) }''')

    # await page.screenshot({'path': './1.jpg'})   # 截图保存路径


    page_text = await page.content()   # 获取网页源码

    print(page_text)

    time.sleep(1)

asyncio.get_event_loop().run_until_complete(main()) #调用




import asyncio

from pyppeteer import launch



async def main():

    # headless参数设为False,则变成有头模式

    # Pyppeteer支持字典和关键字传参,Puppeteer只支持字典传参

    

    # 指定引擎路径

    # exepath = r'C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\575458\chrome-win32/chrome.exe'

    # browser = await launch({'executablePath': exepath, 'headless': False, 'slowMo': 30})

    

    browser = await launch(

        # headless=False,

        {'headless': False}

    )


    page = await browser.newPage()


    # 设置页面视图大小

    await page.setViewport(viewport={'width': 1280, 'height': 800})


    # 是否启用JS,enabled设为False,则无渲染效果

    await page.setJavaScriptEnabled(enabled=True)

    # 超时间见 1000 毫秒

    res = await page.goto('https://www.toutiao.com/', options={'timeout': 1000})

    resp_headers = res.headers  # 响应头

    resp_status = res.status  # 响应状态

    

    # 等待

    await asyncio.sleep(2)

    # 第二种方法,在while循环里强行查询某元素进行等待

    while not await page.querySelector('.t'):

        pass

    # 滚动到页面底部

    await page.evaluate('window.scrollBy(0, document.body.scrollHeight)')


    await asyncio.sleep(2)

    # 截图 保存图片

    await page.screenshot({'path': 'toutiao.png'})


    # 打印页面cookies

    print(await page.cookies())


    """  打印页面文本 """

    # 获取所有 html 内容

    print(await page.content())


    # 在网页上执行js 脚本

    dimensions = await page.evaluate(pageFunction='''() => {

            return {

                width: document.documentElement.clientWidth,  // 页面宽度

                height: document.documentElement.clientHeight,  // 页面高度

                deviceScaleFactor: window.devicePixelRatio,  // 像素比 1.0000000149011612

            }

        }''', force_expr=False)  # force_expr=False  执行的是函数

    print(dimensions)


    #  只获取文本  执行 js 脚本  force_expr  为 True 则执行的是表达式

    content = await page.evaluate(pageFunction='document.body.textContent', force_expr=True)

    print(content)


    # 打印当前页标题

    print(await page.title())


    # 抓取新闻内容  可以使用 xpath 表达式

    """

    # Pyppeteer 三种解析方式

    Page.querySelector()  # 选择器

    Page.querySelectorAll()

    Page.xpath()  # xpath  表达式

    # 简写方式为:

    Page.J(), Page.JJ(), and Page.Jx()

    """

    element = await page.querySelector(".feed-infinite-wrapper > ul>li")  # 纸抓取一个

    print(element)

    # 获取所有文本内容  执行 js

    content = await page.evaluate('(element) => element.textContent', element)

    print(content)


    # elements = await page.xpath('//div[@class="title-box"]/a')

    elements = await page.querySelectorAll(".title-box a")

    for item in elements:

        print(await item.getProperty('textContent'))

        # <pyppeteer.execution_context.JSHandle object at 0x000002220E7FE518>


        # 获取文本

        title_str = await (await item.getProperty('textContent')).jsonValue()


        # 获取链接

        title_link = await (await item.getProperty('href')).jsonValue()

        print(title_str)

        print(title_link)


    # 关闭浏览器

    await browser.close()



asyncio.get_event_loop().run_until_complete(main())





登录出现 滑块 和cookies获取


import asyncio

from pyppeteer import launch


async def main():

    browser = await launch({'headless': False, 'args': ['--disable-infobars', '--window-size=1920,1080']})

    page = await browser.newPage()

    await page.setViewport({'width': 1920, 'height': 1080})

    await page.goto('https://login.taobao.com/member/login.jhtml')

    await page.evaluate('''() =>{ Object.defineProperties(navigator,{ webdriver:{ get: () => false } }) }''')

    await page.waitForSelector('#J_QRCodeLogin > div.login-links > a.forget-pwd.J_Quick2Static', {'timeout': 3000})

    await page.click('#J_QRCodeLogin > div.login-links > a.forget-pwd.J_Quick2Static')

    await page.type('#TPL_username_1', '')  # 账号

    await page.type('#TPL_password_1', '')  # 密码

    await asyncio.sleep(5)

    slider = await page.Jeval('#nocaptcha', 'node => node.style')  # 是否有滑块,ps:试了好多次都没出滑块

    if slider:

        print('出现滑块')

    await page.click('#J_SubmitStatic')

    await asyncio.sleep(5)

    cookie = await page.cookies()

    print(cookie)

    await browser.close()


asyncio.get_event_loop().run_until_complete(main())




Example: open web page and take a screenshot.


import asyncio

from pyppeteer import launch


async def main():

    browser = await launch()

    page = await browser.newPage()

    await page.goto('http://example.com')

    await page.screenshot({'path': 'example.png'})

    await browser.close()


asyncio.get_event_loop().run_until_complete(main())

Example: evaluate script on the page.


import asyncio

from pyppeteer import launch


async def main():

    browser = await launch()

    page = await browser.newPage()

    await page.goto('http://example.com')

    await page.screenshot({'path': 'example.png'})


    dimensions = await page.evaluate('''() => {

        return {

            width: document.documentElement.clientWidth,

            height: document.documentElement.clientHeight,

            deviceScaleFactor: window.devicePixelRatio,

        }

    }''')


    print(dimensions)

    # >>> {'width': 800, 'height': 600, 'deviceScaleFactor': 1}

    await browser.close()


asyncio.get_event_loop().run_until_complete(main())

Pyppeteer has almost same API as puppeteer. More APIs are listed in the document.


Puppeteer's document and troubleshooting are also useful for pyppeteer users.


Differences between puppeteer and pyppeteer

Pyppeteer is to be as similar as puppeteer, but some differences between python and JavaScript make it difficult.


These are differences between puppeteer and pyppeteer.


Keyword arguments for options

Puppeteer uses object (dictionary in python) for passing options to functions/methods. Pyppeteer accepts both dictionary and keyword arguments for options.


Dictionary style option (similar to puppeteer):


browser = await launch({'headless': True})

Keyword argument style option (more pythonic, isn't it?):


browser = await launch(headless=True)

Element selector method name ($ -> querySelector)

In python, $ is not usable for method name. So pyppeteer uses Page.querySelector()/Page.querySelectorAll()/Page.xpath() instead of Page.$()/Page.$$()/Page.$x(). Pyppeteer also has shorthands for these methods, Page.J(), Page.JJ(), and Page.Jx().


Arguments of Page.evaluate() and Page.querySelectorEval()

Puppeteer's version of evaluate() takes JavaScript raw function or string of JavaScript expression, but pyppeteer takes string of JavaScript. JavaScript strings can be function or expression. Pyppeteer tries to automatically detect the string is function or expression, but sometimes it fails. If expression string is treated as function and error is raised, add force_expr=True option, which force pyppeteer to treat the string as expression.


Example to get page content:


content = await page.evaluate('document.body.textContent', force_expr=True)

Example to get element's inner text:


element = await page.querySelector('h1')

title = await page.evaluate('(element) => element.textContent', element) 


Top