Skip to content Skip to sidebar Skip to footer

Resolving Html Entities With Nsxmlparser On Iphone

I think I read every single web page relating to this problem but I still cannot find a solution to it, so here I am. I have an HTML web page which is not under my control and I ne

Solution 1:

After exploring several alternatives, it appears that NSXMLParser will not support entities other than the standard entities <, >, ', " and &

The code below fails resulting in an NSXMLParserUndeclaredEntityError.

// Create a dictionary to hold the entities and NSString equivalents// A complete list of entities and unicode values is described in the HTML DTD// which is available for download http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.entNSDictionary *entityMap = [NSDictionary dictionaryWithObjectsAndKeys: 
                     [NSString stringWithFormat:@"%C", 0x00E8], @"egrave",
                     [NSString stringWithFormat:@"%C", 0x00E0], @"agrave", 
                     ...
                     ,nil];

NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser setShouldResolveExternalEntities:YES];
[parser parse];

// NSXMLParser delegate method
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName:(NSString *)entityName systemID:(NSString *)systemID {
    return [[entityMap objectForKey:entityName] dataUsingEncoding: NSUTF8StringEncoding];
}

Attempts to declare the entities by prepending the HTML document with ENTITY declarations will pass, however the expanded entities are not passed back to parser:foundCharacters and the è and à characters are dropped.

<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE HTMLPUBLIC"-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
  <!ENTITY agrave"à"><!ENTITY egrave"è">
]>

In another experiment, I created a completely valid xml document with an internal DTD

<?xml version="1.0" standalone="yes" ?><!DOCTYPE author [
    <!ELEMENT author (#PCDATA)><!ENTITY js"Jo Smith">
]><author>&lt;&js;&gt;</author>

I implemented the parser:foundInternalEntityDeclarationWithName:value:; delegate method and it is clear that the parser is getting the entity data, however the parser:foundCharacters is only called for the pre-defined entities.

2010-03-20 12:53:59.871 xmlParsing[1012:207]ParserDidStartDocument2010-03-20 12:53:59.873 xmlParsing[1012:207]Parser foundElementDeclarationWithName: author model:2010-03-20 12:53:59.873 xmlParsing[1012:207]Parser foundInternalEntityDeclarationWithName: js value:JoSmith2010-03-20 12:53:59.874 xmlParsing[1012:207]didStartElement: author type:(null)2010-03-20 12:53:59.875 xmlParsing[1012:207]parser foundCharacters Before:2010-03-20 12:53:59.875 xmlParsing[1012:207]parser foundCharacters After:<2010-03-20 12:53:59.876 xmlParsing[1012:207]parser foundCharacters Before:<2010-03-20 12:53:59.876 xmlParsing[1012:207]parser foundCharacters After:<2010-03-20 12:53:59.877 xmlParsing[1012:207]parser foundCharacters Before:<2010-03-20 12:53:59.878 xmlParsing[1012:207]parser foundCharacters After:<2010-03-20 12:53:59.879 xmlParsing[1012:207]parser foundCharacters Before:<2010-03-20 12:53:59.879 xmlParsing[1012:207]parser foundCharacters After:<>2010-03-20 12:53:59.880 xmlParsing[1012:207]didEndElement: author with content:<>2010-03-20 12:53:59.880 xmlParsing[1012:207]ParserDidEndDocument

I found a link to a tutorial on Using the SAX Interface of LibXML. The xmlSAXHandler that is used by NSXMLParser allows for a getEntity callback to be defined. After calling getEntity, the expansion of the entity is passed to the characters callback.

NSXMLParser is missing functionality here. What should happen is that the NSXMLParser or its delegate store the entity definitions and provide them to the xmlSAXHandlergetEntity callback. This is clearly not happening. I will file a bug report.

In the meantime, the earlier answer of performing a string replacement is perfectly acceptable if your documents are small. Check out the SAX tutorial mentioned above along with the XMLPerformance sample app from Apple to see if implementing the libxml parser on your own is worthwhile.

This has been fun.

Solution 2:

A possibly less hacky solution is replace the DTD with a local modified one with all external entity declaration replaced with local one.

This is how I do it:

First, find and replace the document DTD declaration with a local file. For example, replace this:

<!DOCTYPE htmlPUBLIC"-//W3C//DTD XHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><body><ahref='a.html'>hi!</a><br><p>Hello</p></body></html>

with this:

<!DOCTYPE htmlPUBLIC"-//W3C//DTD XHTML 1.0 Transitional//EN""file://localhost/Users/siuying/Library/Application%20Support/iPhone%20Simulator/6.1/Applications/17065C0F-6754-4AD0-A1EA-9373F6476F8F/App.app/xhtml1-transitional.dtd"><html><body><ahref='a.html'>hi!</a><br><p>Hello</p></body></html>

```

Download the DTD from the W3C URL and add it to your app bundle. You can find the path of the file with following code:

NSBundle* bundle = [NSBundle bundleForClass:[selfclass]];
NSString* path = [[bundle URLForResource:@"xhtml1-transitional" withExtension:@"dtd"] absoluteString];

Open the DTD file, find any external entity reference:

<!ENTITY % HTMLlat1PUBLIC"-//W3C//ENTITIES Latin 1 for XHTML//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;      

replace it with the content of the entity file ( http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent in the above case)

After replacing all external reference, NSXMLParser should properly handle the entities without the need to download every remote DTD/external entities each time it parse a XML file.

Solution 3:

You could do a string replace within the data before you parse it with NSXMLParser. NSXMLParser is UTF-8 only as far as I know.

Solution 4:

I think your going to run into another problem with this example as it isn't vaild XML which is what the NSXMLParser is looking for.

The exact problem in the above is that the tags META, LI, HTML and BODY aren't closed so the parser looks all the way though the rest of the document looking for its closing tag.

The only way around this that I know of if you don't have access to change the HTML is to mirror it with the closing tags inserted.

Solution 5:

I would try using a different parser, like libxml2 - in theory I think that one should be able to handle poor HTML.

Post a Comment for "Resolving Html Entities With Nsxmlparser On Iphone"