Scrape Google Search Results Titles And Urls Using Python

October 31, 2022 Post a Comment

I'm working on a project using Python(3.7) in which I need to scrape the first few Google results for Titles and Urls, I have tried it using BeautifulSoup but it doesn't work: Here

Solution 1:

You should try automation selenium library. it allows you to scrape dynamic rendering request(js or ajax) page data.

from selenium import webdriver
from bs4 import BeautifulSoup
import time
from bs4.element import Tag

driver = webdriver.Chrome('/usr/bin/chromedriver')
google_url = "https://www.google.com/search?q=python" + "&num=" + str(5)
driver.get(google_url)
time.sleep(3)

soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})


links = []
titles = []
descriptions = []
for r in result_div:
    # Checks if each element is present, else, raise exception
    try:
        link = r.find('a', href=True)
        title = None
        title = r.find('h3')

        if isinstance(title,Tag):
            title = title.get_text()

        description = None
        description = r.find('span', attrs={'class': 'st'})

        if isinstance(description, Tag):
            description = description.get_text()

        # Check to make sure everything is present before appending
        if link != '' and title != '' and description != '':
            links.append(link['href'])
            titles.append(title)
            descriptions.append(description)
    # Next loop if one element is not present
    except Exception as e:
        print(e)
        continue

print(titles)
print(links)
print(descriptions)

O/P:

['Welcome to Python.org', 'Download Python | Python.org', 'Python Tutorial - W3Schools', 'Introduction to Python - W3Schools', 'Python Programming Language - GeeksforGeeks', 'Python: 7 Important Reasons Why You Should Use Python - Medium', 'Python: 7 Important Reasons Why You Should Use Python - Medium', 'Python Tutorial - Tutorialspoint', 'Python Download and Installation Instructions', 'Python vs C++ - Find Out The 9 Important Differences - eduCBA', None, 'Description']
['https://www.python.org/', 'https://www.python.org/downloads/', 'https://www.w3schools.com/python/', 'https://www.w3schools.com/python/python_intro.asp', 'https://www.geeksforgeeks.org/python-programming-language/', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://www.tutorialspoint.com/python/', 'https://www.ics.uci.edu/~pattis/common/handouts/pythoneclipsejava/python.html', 'https://www.educba.com/python-vs-c-plus-plus/', '/search?num=5&q=Python&stick=H4sIAAAAAAAAAONgFuLQz9U3MK0yjFeCs7SEs5Ot9JPzc3Pz86yKM1NSyxMri1cxsqVZOQZ4Fi9iZQuoLMnIzwMAlVPV1j0AAAA&sa=X&ved=2ahUKEwigvcqKx8XiAhUOSX0KHdtmBgoQzTooADAQegQIChAC', 'mailto:?body=Python%20https%3A%2F%2Fwww.google.com%2Fsearch%3Fkgmid%3D%2Fm%2F05z1_%26hl%3Den-IN%26kgs%3De1764a9f31831e11%26q%3DPython%26shndl%3D0%26source%3Dsh%2Fx%2Fkp%26entrypoint%3Dsh%2Fx%2Fkp']
['The official home of the Python Programming Language.', 'Looking for Python 2.7? See below for specific releases. Contribute to the PSF by Purchasing a PyCharm License. All proceeds benefit the PSF. Donate Now\xa0...', 'Python can be used on a server to create web applications. ... Our "Show Python" tool makes it easy to learn Python, it shows both the code and the result.', 'What is Python? Python is a popular programming language. It was created by Guido van Rossum, and released in 1991. It is used for: web development\xa0...', 'Python is a widely used general-purpose, high level programming language. It was initially designed by Guido van Rossum in 1991 and developed by Python\xa0...', None, None, None, None, None, None, None]

where '/usr/bin/chromedriver' selenium web driver path.

Download selenium web driver for chrome browser:

http://chromedriver.chromium.org/downloads

Install web driver for chrome browser:

https://christopher.su/2015/selenium-chromedriver-ubuntu/

Selenium tutorial:

https://selenium-python.readthedocs.io/

Solution 2:

There's absolutely no need in selenium, the elements are there, in the HTML and it's not rendered like YouTube or Google Maps.

Try to use .select()/.select_one() because it is usually faster, prettier and more flexible rather than .find()/.findAll(). CSS selectors reference.

Also, you are not actually raising any exception, instead you continue the code execution.

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=ice cream', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')

# collect data
data = []

for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']
  snippet = result.select_one('#rso .lyLwlc').text

  # appending data to an array
  data.append({
      'title': title,
      'link': link,
      'snippet': snippet,
  })

print(json.dumps(data, indent=2, ensure_ascii=False))

--------
'''
[
  {
    "title": "Ice cream - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Ice_cream",
    "snippet": "Ice cream is a sweetened frozen food typically eaten as a snack or dessert. It may be made from dairy milk or cream and is flavoured with a sweetener, ..."
  }
...
]
'''

Alternatively, you can use Google Organic Results API from SerpApi. It's a paid API with a free plan.

The main difference is that you only need to iterate over existing parsed JSON data from Google or other search engines SerpApi supports, rather than making everything from scratch and maintain the parser over time.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "ice cream",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")

----
'''
Title: Ice cream - Wikipedia
Summary: Ice cream is a sweetened frozen food typically eaten as a snack or dessert. It may be made from dairy milk or cream and is flavoured with a sweetener, ...
Link: https://en.wikipedia.org/wiki/Ice_cream
...
'''

Disclaimer, I work for SerpApi.

Html5 Development

Scrape Google Search Results Titles And Urls Using Python

Solution 1:

Solution 2:

Post a Comment for "Scrape Google Search Results Titles And Urls Using Python"