Extracting Title From HTML Not Working
I'm performing some text analytics on a large number of novels downloaded from Gutenberg. I want to keep as much metadata as a I can, so I'm downloading as html then later converti
Solution 1:
You can use other BS4 methods, like this one:
title_data = soup.find('title').get_text()
Solution 2:
Try this One :
title_data = soup.find(".//title").text
or
title_data = soup.findtext('.//title')
Solution 3:
Try to use html.parser
instead of lxml
e.g:
from bs4 import BeautifulSoup
### Opens html file
html = open("filepath/Jane_Eyre.htm")
### Cleans html file
soup = BeautifulSoup(html, 'html.parser')
title_data = soup.title.string
Your html
tag has a namespace, so if you try to parse it with lxml
you should respect the namespaces.
Solution 4:
Why not simply use lxml
?
from lxml import html
page = html.fromstring(source_string)
title = page.xpath("/title/text()")[0]
Solution 5:
The following approach works to extract the titles from html file of Gutenberg ebooks.
>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.gutenberg.org/ebooks/subject/99'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required = soup.find_all("span", {"class": "title"})
>>> x1 = []
>>> for i in required:
... x1.append(i.get_text())
...
>>> for i in x1:
... print(i)
...
Sort Alphabetically
Sort by Release Date
Great Expectations
Jane Eyre: An Autobiography
Les Misérables
Oliver Twist
Anne of Green Gables
David Copperfield
The Secret Garden
Anne of the Island
Anne of Avonlea
A Little Princess
Kim
Anne's House of Dreams
Heidi
The Mysteries of Udolpho
Of Human Bondage
The Secret Garden
Daddy-Long-Legs
Les misérables Tome I: Fantine (French)
Jane Eyre
Rose in Bloom
Further Chronicles of Avonlea
The Children of the New Forest
Oliver Twist; or, The Parish Boy's Progress. Illustrated
The Personal History of David Copperfield
Heidi
>>>
Post a Comment for "Extracting Title From HTML Not Working"