Skip to content Skip to sidebar Skip to footer

Beautifulsoup Html Table Parsing--only Able To Get The Last Row?

I have a simple HTML table to parse but somehow Beautifulsoup is only able to get me results from the last row. I'm wondering if anyone would take a look at that and see what's wro

Solution 1:

You need to extend if you want all the th tags in a single list, you just keep reassigning cells = row.find_all('th') so when your print cells outside the loop you will only see what it was last assigned to i.e the last th in the last tr:

cells = []
forrowinrows:
 cells.extend(row.find_all('th'))

Also since there is only one table you can just use find:

soup = BeautifulSoup(html)

table = soup.find("table", class_="participants-table")

If you want to skip the thead row you can use a css selector:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

rows = soup.select("table.participants-table  thead ~ tr")

cells = [tr.th for tr in rows]
print(cells)

cells will give you:

[<thclass="name"><ahref="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>, <thclass="name"><ahref="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]

To write the whole table to csv:

import csv

soup = BeautifulSoup(html, "html.parser")

rows= soup.select("table.participants-table tr")

withopen("data.csv", "w") asout:
    wr = csv.writer(out)
    wr.writerow([th.text for th inrows[0].find_all("th")] + ["URL"])

    forrowinrows[1:]:
        wr.writerow([tag.text for tag in row.find_all()] + [row.th.a["href"]])

which for you sample will give you:

Name,Type,Sector,Country,JoinedOn,URLGrontmij,Company,GeneralIndustrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-GrontmijGroupeBial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial

Post a Comment for "Beautifulsoup Html Table Parsing--only Able To Get The Last Row?"