当前位置：首页 > news >正文

广州网站排名优化服务商丘手机网站制作

news 2026/5/5 21:34:48

广州网站排名优化服务,商丘手机网站制作,网站建设对企业带来什么作用,网店托管代运营公司WebScraper 工具类使用笔记#xff1a;灵活易用的爬虫框架序言#xff1a; 安装好对应插件#xff0c;驱动 pip install selenium webdriver-manager 1. 类名#xff1a;WebScraper 这个工具类封装了浏览器控制、页面交互和数据提取的核心功能#xff0c;旨在提供一…WebScraper 工具类使用笔记灵活易用的爬虫框架序言安装好对应插件驱动 pip install selenium webdriver-manager 1. 类名WebScraper 这个工具类封装了浏览器控制、页面交互和数据提取的核心功能旨在提供一个灵活且易于使用的爬虫框架。 2. 初始化方法 __init__(browser_typechrome, headlessTrue, user_agentNone, proxyNone, timeout30, debugFalse) 功能初始化爬虫实例配置浏览器和开发工具参数 browser_type浏览器类型可选值“chrome”, “firefox”, “edge”headless是否以无头模式运行浏览器user_agent自定义 User-Agent 字符串proxy代理服务器配置格式{http: http://proxy.example.com:8080, https: http://proxy.example.com:8080}timeout操作超时时间秒debug是否开启调试模式 3. 浏览器控制方法 open_url(url) 功能打开指定 URL 参数 url目标 URL 返回页面加载完成状态 close() 功能关闭浏览器实例参数无 refresh() 功能刷新当前页面参数无 go_back() 功能返回上一页参数无 4. 元素定位与交互方法 find_element(selector, bycss, timeoutNone) 功能查找单个元素参数 selector选择器字符串by选择器类型可选值“css”, “xpath”, “id”, “class”, “name”, “link_text”, “partial_link_text”, “tag_name”timeout等待元素出现的超时时间秒返回找到的元素对象或 None find_elements(selector, bycss, timeoutNone) 功能查找多个元素参数同find_element返回找到的元素列表 click(elementNone, selectorNone, bycss, timeoutNone) 功能点击元素参数 element元素对象优先使用selector选择器字符串当 element 为 None 时使用by选择器类型timeout等待元素出现的超时时间返回操作结果 type_text(text, elementNone, selectorNone, bycss, timeoutNone, clear_firstTrue) 功能在输入框中输入文本参数 text要输入的文本element元素对象优先使用selector选择器字符串当 element 为 None 时使用by选择器类型timeout等待元素出现的超时时间clear_first是否先清空输入框返回操作结果 5. 滚动方法 scroll(directiondown, amountNone, elementNone, smoothTrue, duration0.5) 功能滚动页面或元素参数 direction滚动方向可选值“up”, “down”, “left”, “right”amount滚动量像素默认为页面高度 / 宽度的 50%element要滚动的元素默认为整个页面smooth是否平滑滚动duration滚动持续时间秒返回操作结果 scroll_to_element(elementNone, selectorNone, bycss, timeoutNone, aligncenter) 功能滚动到指定元素参数 element元素对象优先使用selector选择器字符串当 element 为 None 时使用by选择器类型timeout等待元素出现的超时时间align元素对齐方式可选值“top”, “center”, “bottom” 返回操作结果 scroll_to_bottom(elementNone, steps10, delay0.5) 功能滚动到页面或元素底部参数 element要滚动的元素默认为整个页面steps滚动步数delay每步之间的延迟秒返回操作结果 6. 翻页方法 next_page(selectorNone, methodclick, url_templateNone, page_parampage, next_page_funcNone) 功能翻到下一页参数 selector下一页按钮的选择器当 method 为 “click” 时使用method翻页方法可选值“click”, “url”, “function”url_templateURL 模板当 method 为 “url” 时使用page_param页码参数名当 method 为 “url” 时使用next_page_func自定义翻页函数当 method 为 “function” 时使用返回翻页是否成功 has_next_page(selectorNone, check_funcNone) 功能检查是否有下一页参数 selector下一页按钮的选择器check_func自定义检查函数返回布尔值表示是否有下一页 set_page(page_num, url_templateNone, page_parampage) 功能跳转到指定页码参数 page_num目标页码url_templateURL 模板page_param页码参数名返回操作结果 7. 数据提取方法 get_text(elementNone, selectorNone, bycss, timeoutNone) 功能获取元素的文本内容参数同find_element返回文本内容或 None get_attribute(attribute, elementNone, selectorNone, bycss, timeoutNone) 功能获取元素的属性值参数 attribute属性名其他参数同find_element 返回属性值或 None extract_data(template) 功能根据模板提取页面数据参数 template数据提取模板格式为字典键为数据字段名值为选择器或提取函数返回提取的数据 8. DevTools 方法 start_capturing_network() 功能开始捕获网络请求参数无 stop_capturing_network() 功能停止捕获网络请求参数无 get_captured_requests(filter_typeNone, url_patternNone) 功能获取捕获的网络请求参数 filter_type请求类型过滤可选值“xhr”, “fetch”, “script”, “image”, “stylesheet” 等url_patternURL 模式过滤支持正则表达式返回符合条件的请求列表 add_request_interceptor(pattern, handler_func) 功能添加请求拦截器参数 patternURL 匹配模式handler_func处理函数接收请求对象可修改请求或返回自定义响应返回拦截器 ID 9. 辅助方法 wait_for_element(selector, bycss, timeoutNone, conditionvisible) 功能等待元素满足特定条件参数 selector选择器字符串by选择器类型timeout超时时间condition等待条件可选值“visible”, “present”, “clickable”, “invisible”, “not_present” 返回元素对象或 None execute_script(script, *args) 功能执行 JavaScript 代码参数 scriptJavaScript 代码*args传递给 JavaScript 的参数返回JavaScript 执行结果 set_delay(min_delay, max_delayNone) 功能设置操作之间的随机延迟参数 min_delay最小延迟时间秒max_delay最大延迟时间秒如果为 None 则固定为 min_delay 返回无 take_screenshot(pathNone) 功能截取当前页面截图参数 path保存路径如果为 None 则返回图像数据返回如果 path 为 None返回图像二进制数据否则返回保存结果 10. 代码实现部分 python import time import random import json import re from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException, NoSuchElementException, WebDriverException from selenium.webdriver.common.action_chains import ActionChains# Optional: For easier driver management try:from webdriver_manager.chrome import ChromeDriverManagerfrom webdriver_manager.firefox import GeckoDriverManagerfrom webdriver_manager.microsoft import EdgeChromiumDriverManagerWEBDRIVER_MANAGER_AVAILABLE True except ImportError:WEBDRIVER_MANAGER_AVAILABLE Falseprint(Consider installing webdriver-manager for easier driver setup: pip install webdriver-manager)class WebScraper:_BY_MAP {css: By.CSS_SELECTOR,xpath: By.XPATH,id: By.ID,class: By.CLASS_NAME, # Note: find by class name only works for a single classname: By.NAME,link_text: By.LINK_TEXT,partial_link_text: By.PARTIAL_LINK_TEXT,tag_name: By.TAG_NAME,}def __init__(self, browser_typechrome, headlessTrue, user_agentNone, proxyNone, timeout30, debugFalse):self.browser_type browser_type.lower()self.headless headlessself.user_agent user_agentself.proxy proxyself.timeout timeoutself.debug debugself.driver Noneself.current_page_num 1 # For URL-based paginationself._min_delay 0.5self._max_delay 1.5self._network_requests_raw [] # To store JS collected network entriesself._setup_driver()def _print_debug(self, message):if self.debug:print(f[DEBUG] {message})def _setup_driver(self):self._print_debug(fSetting up {self.browser_type} browser...)options Noneservice Noneif self.browser_type chrome:options webdriver.ChromeOptions()if self.user_agent:options.add_argument(fuser-agent{self.user_agent})if self.headless:options.add_argument(--headless)options.add_argument(--window-size1920x1080) # Often needed for headlessif self.proxy:if http in self.proxy: # Basic proxy, for more auth use selenium-wireoptions.add_argument(f--proxy-server{self.proxy[http]})elif https in self.proxy: # Selenium typically uses one proxy for alloptions.add_argument(f--proxy-server{self.proxy[https]})options.add_argument(--disable-gpu)options.add_argument(--no-sandbox)options.add_argument(--disable-dev-shm-usage) # Overcome limited resource problemsif WEBDRIVER_MANAGER_AVAILABLE:try:service webdriver.chrome.service.Service(ChromeDriverManager().install())self.driver webdriver.Chrome(serviceservice, optionsoptions)except Exception as e:self._print_debug(fWebDriverManager for Chrome failed: {e}. Falling back to default PATH.)self.driver webdriver.Chrome(optionsoptions) # Fallback to PATHelse:self.driver webdriver.Chrome(optionsoptions)elif self.browser_type firefox:options webdriver.FirefoxOptions()if self.user_agent:options.set_preference(general.useragent.override, self.user_agent)if self.headless:options.add_argument(--headless)if self.proxy:# Firefox proxy setup is more involved via preferencesif http in self.proxy:host, port self.proxy[http].replace(http://, ).split(:)options.set_preference(network.proxy.type, 1)options.set_preference(network.proxy.http, host)options.set_preference(network.proxy.http_port, int(port))if https in self.proxy: # Assuming same proxy for httpshost, port self.proxy[https].replace(https://, ).split(:)options.set_preference(network.proxy.ssl, host)options.set_preference(network.proxy.ssl_port, int(port))# options.set_preference(network.proxy.share_proxy_settings, True) # if one proxy for allif WEBDRIVER_MANAGER_AVAILABLE:try:service webdriver.firefox.service.Service(GeckoDriverManager().install())self.driver webdriver.Firefox(serviceservice, optionsoptions)except Exception as e:self._print_debug(fWebDriverManager for Firefox failed: {e}. Falling back to default PATH.)self.driver webdriver.Firefox(optionsoptions)else:self.driver webdriver.Firefox(optionsoptions)elif self.browser_type edge:options webdriver.EdgeOptions()if self.user_agent:options.add_argument(fuser-agent{self.user_agent})if self.headless:options.add_argument(--headless) # Edge uses Chromium engineoptions.add_argument(--window-size1920x1080)if self.proxy and http in self.proxy: # Basic proxyoptions.add_argument(f--proxy-server{self.proxy[http]})options.add_argument(--disable-gpu)if WEBDRIVER_MANAGER_AVAILABLE:try:service webdriver.edge.service.Service(EdgeChromiumDriverManager().install())self.driver webdriver.Edge(serviceservice, optionsoptions)except Exception as e:self._print_debug(fWebDriverManager for Edge failed: {e}. Falling back to default PATH.)self.driver webdriver.Edge(optionsoptions)else:self.driver webdriver.Edge(optionsoptions)else:raise ValueError(fUnsupported browser: {self.browser_type})self.driver.implicitly_wait(self.timeout / 2) # Implicit wait for elementsself.driver.set_page_load_timeout(self.timeout)self._print_debug(f{self.browser_type} browser setup complete.)def _get_selenium_by(self, by_string):by_string by_string.lower()if by_string not in self._BY_MAP:raise ValueError(fInvalid selector type: {by_string}. Supported: {list(self._BY_MAP.keys())})return self._BY_MAP[by_string]def _perform_delay(self):time.sleep(random.uniform(self._min_delay, self._max_delay))# --- Browser Control ---def open_url(self, url):self._print_debug(fOpening URL: {url})try:self.driver.get(url)self._perform_delay()# A simple check, for true loaded status, might need to wait for specific elementreturn self.driver.execute_script(return document.readyState) completeexcept WebDriverException as e:self._print_debug(fError opening URL {url}: {e})return Falsedef close(self):if self.driver:self._print_debug(Closing browser.)self.driver.quit()self.driver Nonedef refresh(self):self._print_debug(Refreshing page.)self.driver.refresh()self._perform_delay()def go_back(self):self._print_debug(Going back to previous page.)self.driver.back()self._perform_delay()# --- Element Location Interaction ---def find_element(self, selector, bycss, timeoutNone):wait_timeout timeout if timeout is not None else self.timeoutself._print_debug(fFinding element by {by}: {selector} with timeout {wait_timeout}s)try:wait WebDriverWait(self.driver, wait_timeout)element wait.until(EC.presence_of_element_located((self._get_selenium_by(by), selector)))return elementexcept TimeoutException:self._print_debug(fElement not found by {by}: {selector} within {wait_timeout}s.)return Noneexcept Exception as e:self._print_debug(fError finding element by {by}: {selector}: {e})return Nonedef find_elements(self, selector, bycss, timeoutNone):wait_timeout timeout if timeout is not None else self.timeoutself._print_debug(fFinding elements by {by}: {selector} with timeout {wait_timeout}s)try:# Wait for at least one element to be present to ensure page readinessWebDriverWait(self.driver, wait_timeout).until(EC.presence_of_all_elements_located((self._get_selenium_by(by), selector)))# Then find all elements without further explicit wait beyond implicitreturn self.driver.find_elements(self._get_selenium_by(by), selector)except TimeoutException:self._print_debug(fNo elements found by {by}: {selector} within {wait_timeout}s.)return []except Exception as e:self._print_debug(fError finding elements by {by}: {selector}: {e})return []def click(self, elementNone, selectorNone, bycss, timeoutNone):if not element and selector:element self.wait_for_element(selector, by, timeout, conditionclickable)if element:try:self._print_debug(fClicking element: {element.tag_name} (selector: {selector}))# Try JavaScript click if standard click is interceptedtry:element.click()except WebDriverException: # e.g. ElementClickInterceptedExceptionself._print_debug(Standard click failed, trying JavaScript click.)self.driver.execute_script(arguments[0].click();, element)self._perform_delay()return Trueexcept Exception as e:self._print_debug(fError clicking element: {e})return Falseself._print_debug(Element not provided or not found for click.)return Falsedef type_text(self, text, elementNone, selectorNone, bycss, timeoutNone, clear_firstTrue):if not element and selector:element self.wait_for_element(selector, by, timeout, conditionvisible)if element:try:self._print_debug(fTyping text {text} into element: {element.tag_name} (selector: {selector}))if clear_first:element.clear()element.send_keys(text)self._perform_delay()return Trueexcept Exception as e:self._print_debug(fError typing text: {e})return Falseself._print_debug(Element not provided or not found for typing.)return False# --- Scrolling Methods ---def scroll(self, directiondown, amountNone, elementNone, smoothTrue, duration0.5):self._print_debug(fScrolling {direction}...)script target windowif element:target arguments[0] # Element will be passed as arguments[0]behavior smooth if smooth else autoif direction down:scroll_val amount if amount is not None else f{target}.innerHeight / 2 if element else window.innerHeight / 2script f{target}.scrollBy({{ top: {scroll_val}, left: 0, behavior: {behavior} }});elif direction up:scroll_val amount if amount is not None else f{target}.innerHeight / 2 if element else window.innerHeight / 2script f{target}.scrollBy({{ top: -{scroll_val}, left: 0, behavior: {behavior} }});elif direction left:scroll_val amount if amount is not None else f{target}.innerWidth / 2 if element else window.innerWidth / 2script f{target}.scrollBy({{ top: 0, left: -{scroll_val}, behavior: {behavior} }});elif direction right:scroll_val amount if amount is not None else f{target}.innerWidth / 2 if element else window.innerWidth / 2script f{target}.scrollBy({{ top: 0, left: {scroll_val}, behavior: {behavior} }});else:self._print_debug(fInvalid scroll direction: {direction})return Falsetry:if element:self.driver.execute_script(script, element)else:self.driver.execute_script(script)time.sleep(duration) # Allow time for smooth scroll to completereturn Trueexcept Exception as e:self._print_debug(fError during scroll: {e})return Falsedef scroll_to_element(self, elementNone, selectorNone, bycss, timeoutNone, aligncenter):if not element and selector:element self.find_element(selector, by, timeout)if element:self._print_debug(fScrolling to element (selector: {selector}) with align: {align})try:# block can be start, center, end, or nearest.# inline is similar for horizontal.# For simplicity, map to block options.align_js { behavior: smooth, block: center, inline: nearest }if align top:align_js { behavior: smooth, block: start, inline: nearest }elif align bottom:align_js { behavior: smooth, block: end, inline: nearest }self.driver.execute_script(farguments[0].scrollIntoView({align_js});, element)self._perform_delay() # Give it a moment to scrollreturn Trueexcept Exception as e:self._print_debug(fError scrolling to element: {e})return Falseself._print_debug(Element not provided or not found for scroll_to_element.)return Falsedef scroll_to_bottom(self, elementNone, steps10, delay0.5):self._print_debug(Scrolling to bottom...)target document.bodytarget_el_for_js Noneif element:target arguments[0]target_el_for_js elementtry:last_height_script freturn {target}.scrollHeightscroll_script f{target}.scrollTop {target}.scrollHeight;for _ in range(steps):if target_el_for_js:last_height self.driver.execute_script(last_height_script, target_el_for_js)self.driver.execute_script(scroll_script, target_el_for_js)else:last_height self.driver.execute_script(last_height_script)self.driver.execute_script(scroll_script)time.sleep(delay)if target_el_for_js:new_height self.driver.execute_script(last_height_script, target_el_for_js)else:new_height self.driver.execute_script(last_height_script)if new_height last_height: # Reached bottom or no more content loadedbreakself._print_debug(Scrolled to bottom (or no more content loaded).)return Trueexcept Exception as e:self._print_debug(fError scrolling to bottom: {e})return False# --- Pagination Methods ---def next_page(self, selectorNone, methodclick, url_templateNone, page_parampage, next_page_funcNone):self._print_debug(fAttempting to go to next page using method: {method})if method click:if not selector:self._print_debug(Selector for next page button is required for click method.)return Falsenext_button self.wait_for_element(selector, conditionclickable)if next_button:return self.click(elementnext_button)else:self._print_debug(Next page button not found or not clickable.)return Falseelif method url:if not url_template:self._print_debug(URL template is required for url method.)return Falseself.current_page_num 1next_url url_template.replace(f{{{page_param}}}, str(self.current_page_num))return self.open_url(next_url)elif method function:if not callable(next_page_func):self._print_debug(A callable function is required for function method.)return Falsetry:return next_page_func(self) # Pass scraper instance to the custom functionexcept Exception as e:self._print_debug(fCustom next_page_func failed: {e})return Falseelse:self._print_debug(fInvalid pagination method: {method})return Falsedef has_next_page(self, selectorNone, check_funcNone):self._print_debug(Checking for next page...)if callable(check_func):try:return check_func(self)except Exception as e:self._print_debug(fCustom check_func for has_next_page failed: {e})return Falseelif selector:# Check if element is present and often, if its not disabledelement self.find_element(selector)if element:is_disabled element.get_attribute(disabled)class_attr element.get_attribute(class)# Common patterns for disabled buttonsif is_disabled or (class_attr and (disabled in class_attr or inactive in class_attr)):self._print_debug(Next page element found but appears disabled.)return Falsereturn Truereturn Falseself._print_debug(No selector or check_func provided for has_next_page.)return False # Default to no next page if insufficient infodef set_page(self, page_num, url_templateNone, page_parampage):if not url_template:self._print_debug(URL template is required for set_page.)return Falseself._print_debug(fSetting page to: {page_num})self.current_page_num page_numtarget_url url_template.replace(f{{{page_param}}}, str(page_num))return self.open_url(target_url)# --- Data Extraction Methods ---def get_text(self, elementNone, selectorNone, bycss, timeoutNone):if not element and selector:element self.find_element(selector, by, timeout)if element:try:text element.textself._print_debug(fExtracted text: {text[:50]}... from element (selector: {selector}))return textexcept Exception as e:self._print_debug(fError getting text: {e})return Noneself._print_debug(Element not provided or not found for get_text.)return Nonedef get_attribute(self, attribute, elementNone, selectorNone, bycss, timeoutNone):if not element and selector:element self.find_element(selector, by, timeout)if element:try:value element.get_attribute(attribute)self._print_debug(fExtracted attribute {attribute}: {value} from element (selector: {selector}))return valueexcept Exception as e:self._print_debug(fError getting attribute {attribute}: {e})return Noneself._print_debug(Element not provided or not found for get_attribute.)return Nonedef extract_data(self, template):Extracts data based on a template.Template format: {field_name: css_selector or (css_selector, attribute_name) or callable}If callable, it receives the scraper instance (self) and the parent_element (if any).To extract multiple items (e.g., a list), the selector should point to the parent of those items,and the callable should handle finding and processing sub-elements.Or, the template value can be a list of sub-templates for structured data.For simplicity here, we assume template values are selectors for single items,or callables for custom logic.self._print_debug(fExtracting data with template: {template})extracted_data {}for field_name, rule in template.items():value Nonetry:if isinstance(rule, str): # Simple CSS selector for textvalue self.get_text(selectorrule)elif isinstance(rule, tuple) and len(rule) 2: # (selector, attribute)value self.get_attribute(selectorrule[0], attributerule[1])elif callable(rule): # Custom extraction functionvalue rule(self) # Pass scraper instanceelse:self._print_debug(fInvalid rule for field {field_name}: {rule})extracted_data[field_name] valueexcept Exception as e:self._print_debug(fError extracting field {field_name} with rule {rule}: {e})extracted_data[field_name] Nonereturn extracted_data# --- DevTools Methods (Limited by standard Selenium) ---def start_capturing_network(self):Clears previously captured network requests (from JS).Actual continuous network capture requires selenium-wire or browsers DevTools Protocol.self._print_debug(Starting network capture (clearing previous JS logs).)self._network_requests_raw []# Note: This doesnt actively start a capture process in the browsers network panel.# It just prepares our internal list for new entries gathered by get_captured_requests.def stop_capturing_network(self):Conceptually stops. With JS method, it means new calls to get_captured_requestswill include data up to this point, but nothing explicitly stops in the browser.self._print_debug(Stopping network capture (conceptual for JS method).)# No direct action for JS based capture, its always available.def get_captured_requests(self, filter_typeNone, url_patternNone):Gets network requests using JavaScript performance API. This is a snapshot.filter_type: e.g., script, img, css, xmlhttprequest, fetchurl_pattern: Regex string to filter URLs.self._print_debug(Getting captured network requests via JavaScript Performance API.)try:# Get all resource timing entriescurrent_entries self.driver.execute_script(return window.performance.getEntriesByType(resource);)if isinstance(current_entries, list):self._network_requests_raw.extend(current_entries) # Append new ones# Deduplicate based on name (URL) and startTime to keep it somewhat manageableseen set()deduplicated_requests []for entry in sorted(self._network_requests_raw, keylambda x: x.get(startTime, 0)):identifier (entry.get(name), entry.get(startTime))if identifier not in seen:deduplicated_requests.append(entry)seen.add(identifier)self._network_requests_raw deduplicated_requestsfiltered_requests []for req in self._network_requests_raw:# req is a dictionary like:# {name: url, entryType: resource, startTime: 123.45, duration: 67.89,# initiatorType: script/img/css/link/xmlhttprequest, etc.}if filter_type:# initiatorType is more reliable for filtering than entryType (always resource)initiator req.get(initiatorType, ).lower()if filter_type.lower() xhr: # Common aliasif initiator ! xmlhttprequest: continueelif filter_type.lower() not in initiator:continueif url_pattern:if not re.search(url_pattern, req.get(name, )):continuefiltered_requests.append(req)self._print_debug(fFound {len(filtered_requests)} filtered network requests.)return filtered_requestsexcept WebDriverException as e:self._print_debug(fError getting network requests via JS: {e})return []def add_request_interceptor(self, pattern, handler_func):NOTE: True request interception is NOT reliably possible with standard Selenium.This requires tools like SeleniumWire or direct DevTools Protocol interaction,which are more complex to set up and manage.This method is a placeholder to acknowledge the design spec.self._print_debug(WARNING: add_request_interceptor is not implemented with standard Selenium. Consider using SeleniumWire for this functionality.)# To make it runnable without error, return a dummy IDreturn fdummy_interceptor_id_{pattern}# --- Auxiliary Methods ---def wait_for_element(self, selector, bycss, timeoutNone, conditionvisible):wait_timeout timeout if timeout is not None else self.timeoutself._print_debug(fWaiting for element by {by}: {selector} to be {condition} (timeout: {wait_timeout}s))try:wait WebDriverWait(self.driver, wait_timeout)sel_by self._get_selenium_by(by)if condition visible:element wait.until(EC.visibility_of_element_located((sel_by, selector)))elif condition present:element wait.until(EC.presence_of_element_located((sel_by, selector)))elif condition clickable:element wait.until(EC.element_to_be_clickable((sel_by, selector)))elif condition invisible:# Returns True if invisible, or an element if it becomes invisible (less common use)# For our purpose, we want the element if it exists and is invisible, or None if it becomes visible/not found# This is tricky. A simpler approach is to check if its NOT visible.# Lets wait for presence, then check visibility.present_element wait.until(EC.presence_of_element_located((sel_by, selector)))if not present_element.is_displayed():element present_elementelse: # Element is present AND visible, so condition invisible is falseraise TimeoutException(fElement {selector} was visible, not invisible.)elif condition not_present:# Returns True if element is not present, or raises TimeoutException# This doesnt return the element. We signal success by returning a dummy True# or failure by returning None after timeout.if wait.until(EC.invisibility_of_element_located((sel_by, selector))): # Waits for staleness or non-presenceself._print_debug(fElement by {by}: {selector} confirmed not present or invisible.)return True # Indicates success for this condition, though no element is returnedelse: # Should not happen if invisibility_of_element_located works as expectedreturn Noneelse:raise ValueError(fUnsupported condition: {condition})return elementexcept TimeoutException:self._print_debug(fElement by {by}: {selector} did not meet condition {condition} within {wait_timeout}s.)return Noneexcept Exception as e:self._print_debug(fError waiting for element {selector} condition {condition}: {e})return Nonedef execute_script(self, script, *args):self._print_debug(fExecuting script: {script[:100]}...)try:return self.driver.execute_script(script, *args)except WebDriverException as e:self._print_debug(fError executing script: {e})return Nonedef set_delay(self, min_delay, max_delayNone):self._print_debug(fSetting delay: min{min_delay}, max{max_delay})self._min_delay min_delayself._max_delay max_delay if max_delay is not None else min_delaydef take_screenshot(self, pathNone):self._print_debug(fTaking screenshot. Path: {path if path else Return as PNG data})try:if path:return self.driver.save_screenshot(path) # Returns True on successelse:return self.driver.get_screenshot_as_png() # Returns binary dataexcept WebDriverException as e:self._print_debug(fError taking screenshot: {e})return None if path is None else Falsedef __enter__(self):return selfdef __exit__(self, exc_type, exc_val, exc_tb):self.close()11. 案例应用部分 python from WebScraper import WebScraper import time import jsondef main():# 初始化WebScraper实例非无头模式便于观察scraper WebScraper(browser_typechrome,headlessFalse,timeout15,debugTrue)try:# 1. 打开百度首页baidu_url https://www.baidu.comprint(正在打开百度首页...)if not scraper.open_url(baidu_url):print(百度首页打开失败)return# 2. 输入搜索关键词并执行搜索search_keyword 人工智能发展趋势 # 可以修改为任意搜索关键词print(f正在搜索: {search_keyword})# 定位搜索框并输入关键词search_input scraper.find_element(selector#kw, bycss)if not search_input:print(未找到搜索框元素)returnscraper.type_text(textsearch_keyword, elementsearch_input)# 点击搜索按钮if not scraper.click(selector#su, bycss):print(点击搜索按钮失败)return# 等待搜索结果加载time.sleep(2)print(搜索结果加载中...)# 3. 滚动到页面底部print(正在滚动到页面底部...)if scraper.scroll_to_bottom(steps10, delay0.1):print(已滚动到页面底部)else:print(滚动到页面底部失败)# 4. 提取当前页面的搜索结果标题print(正在提取当前页面的搜索结果...)result_titles scraper.find_elements(selectorh3.t a, bycss)if result_titles:print(f找到 {len(result_titles)} 个搜索结果标题:)for i, title in enumerate(result_titles, 1):title_text title.textprint(f{i}. {title_text})else:print(未找到搜索结果标题)# 5. 翻页获取更多搜索结果演示前3页for page in range(2, 4):print(f\n正在翻到第 {page} 页...)# 方法1使用next_page方法点击下一页按钮next_button_selector .n # 百度下一页按钮的CSS选择器if scraper.next_page(selectornext_button_selector, methodclick):print(f已翻到第 {page} 页等待加载...)time.sleep(2)# 滚动到新页面底部scraper.scroll_to_bottom(steps10, delay0.3)time.sleep(1)# 提取新页面的搜索结果result_titles scraper.find_elements(selectorh3.t a, bycss)if result_titles:print(f第 {page} 页找到 {len(result_titles)} 个搜索结果标题:)for i, title in enumerate(result_titles, 1):title_text title.textprint(f{i}. {title_text})else:print(f第 {page} 页未找到搜索结果标题)else:print(f翻到第 {page} 页失败可能已到最后一页)break# 6. 使用extract_data方法提取结构化数据print(\n使用extract_data方法提取结构化数据:)data_template {搜索关键词: search_keyword,当前时间: lambda scraper: time.strftime(%Y-%m-%d %H:%M:%S),页面标题: title,当前URL: (, href), # 使用空选择器获取当前URL搜索结果数量: lambda scraper: len(scraper.find_elements(h3.t a))}extracted_data scraper.extract_data(data_template)for key, value in extracted_data.items():print(f{key}: {value})# 7. 保存数据到JSON文件with open(baidu_search_results.json, w, encodingutf-8) as f:json.dump(extracted_data, f, ensure_asciiFalse, indent2)print(\n数据已保存到 baidu_search_results.json)except Exception as e:print(f操作过程中发生错误: {e})finally:# 8. 关闭浏览器scraper.close()print(浏览器已关闭)if __name__ __main__:main()

查看全文

http://www.hkea.cn/news/14546552/