Beautifulsoup Html Table Parsing--only Able To Get The Last Row?
I have a simple HTML table to parse but somehow Beautifulsoup is only able to get me results from the last row. I'm wondering if anyone would take a look at that and see what's wro
Solution 1:
You need to extend if you want all the th tags in a single list, you just keep reassigning cells = row.find_all('th')
so when your print cells outside the loop you will only see what it was last assigned to i.e the last th in the last tr:
cells = []
forrowinrows:
cells.extend(row.find_all('th'))
Also since there is only one table you can just use find:
soup = BeautifulSoup(html)
table = soup.find("table", class_="participants-table")
If you want to skip the thead row you can use a css selector:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
rows = soup.select("table.participants-table thead ~ tr")
cells = [tr.th for tr in rows]
print(cells)
cells will give you:
[<thclass="name"><ahref="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>, <thclass="name"><ahref="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]
To write the whole table to csv:
import csv
soup = BeautifulSoup(html, "html.parser")
rows= soup.select("table.participants-table tr")
withopen("data.csv", "w") asout:
wr = csv.writer(out)
wr.writerow([th.text for th inrows[0].find_all("th")] + ["URL"])
forrowinrows[1:]:
wr.writerow([tag.text for tag in row.find_all()] + [row.th.a["href"]])
which for you sample will give you:
Name,Type,Sector,Country,JoinedOn,URLGrontmij,Company,GeneralIndustrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-GrontmijGroupeBial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial
Post a Comment for "Beautifulsoup Html Table Parsing--only Able To Get The Last Row?"