|
New TDICustomHtmlWriterPlugin intermediate interface for greater flexibilty in customizing TDIHtmlWriterPlugin.
New TDIHtmlParser.DataAsStrTrim8 convenience method.
Change case of HTML tag constants to lower case. This achieves slightly better results for HTML compression.
Bring DIHtmlParser_BookmarkParser demo up to date with latest Mozilla and Chrome bookmark files.
Improved documentation layout.
TDIHtmlParser: When parsing JavaScript, a forward slash "/" inside a regular expression character class was not recognized as such and could lead to an infinite loop.
TDIHtmlCharSetPlugin: Correct decoding function for "GBK" encoding which did not read the 1 to 127 character range.
Work around an unexpected Delphi 2009 automatic numeric AnsiChar Unicode conversion in DIUtils which caused an error when compiled on a Windows OS set to a non-European (Asian, Cyrillic, etc.) codepage.
TDIHtmlTag, TDICustomTag, TDISsiTag: .ConCatValue must not escape a '&' character in an attribute value immediately followed by a '{' character ( HTML 4.0.1 Section B.7.1).
Multiple fixes for filtering, most notable for TDITagFilters.SetStart.
Better HTML title parsing according to how FireFox does it.
TDIHtmlParser.TrimAttribValues behaved exactly opposit as intended.
Modify DIHtmlParser_ C6.bpk so that it should compile and install again with C++ Builder 6.
CharSetConverter demo: Add BOM detection.
ExtractText demo: Optional Unicode output controlled by compiler directive. Also add more tags to improve HTML → Text conversion.
WebDownload demo: Improve generation of document names if URI has a query part.
WriterPlugin demo: Support DIHtmlParser1.EnableHtmlTags.
Some new, simple console demos inspired by support questions.
Improve compatibility for parallel installation with other DI packages.
Some code cleanup.
Compatibility with DIConverters 1.11. If you are using DIHtmlParser with DIConverters and encounter incompatibility problems after upgrading to this new version, be sure to use the new version of DIConverters as well.
Add XP Themes to Demo projects.
Fixed a problem when parsing certain kinds of regular expression escapes in JavaScript.
Reduced memory requirements for quickly skipping over JavaScript.
Fixed filtering bugs in TDIHtmlParser.FindHtmlTag, TDIHtmlParser.FindSsiTag, and TDIHtmlParser.ParseNextHtmlTag.
New TDIHtmlParser.EnableHtmlTags property which controls if HTML tags are properly recognized as such or are simply treated as text. Ignoring HTML tags can be useful for HTML scripting.
New TDIHtmlParser.TrimAttribValues property which controls if whitespace are automatically trimmed when parsing the attribute values of tags.
Improved parsing of CustomTags and ASP.
Fixed an error which could prematurely stop TDIUnicodeReader when a pushed source was popped at the end of a nested document.
Added Delphi 3 compatibility to the utility units.
Resolved dependency issues when DIHtmlParser is used in parallel with other DI products.
Compatibility with other DI products.
Added the options to link DIHtmlParser against DIConverters, which enables DIHtmlParser to read and write 130+ character encodings.
Added native Pascal implementation for reading / decoding and writing / encoding the following character sets:
Mac Arabic, Mac Dingbats, Mac Central Europe, Mac Croatian, Mac Cyrillic, Mac Farsi, Mac Greek, Mac Hebrew, Mac Iceland, Mac Roman, Mac Romanian, Mac Thai, Mac Turkish
UCS-2 LE, CS-2 BE,
UCS-4 LE, UCS-4 BE
UTF-32 LE, UTF-32 BE
UTF-7 (Write_UTF_7 / Read_UTF_7)
UTF-7 Optional Direct Characters (Write_UTF_7_ODC / reads as Read_UTF_7)
JIS X0201, NextStep, TIS 620
Improved the parser's handling of malicious markup frequently used in Spam E-Mail: The parser now treats invalid tags (like '<k$R>') as HTML Tags instead of Text. There is also a new piece type ptExclamationMarkup covering inserts starting with an exclamation mark like '<!A>'. It is returned for the character patterns '<! … >' which are not Comments, CData Sections, Document Templates, or SSI.
Improved parsing of non-conformant XML Processing Instruction (XmlPI), marked as '<?XML Char* ?>'. By specification, XmlPI must terminate with '?>', but the '?' is sometimes missing. Specification conformant parsing would then cause DIHtmlParser unintentionally to interpret lengthy stretches as XmlPI. This is now fixed by recognizing both variants as ending an XmlPI.
Improved the recognition of HTML entities lacking a terminating semicolon character (like ' ') in some cases.
Added mapping of some illegal but commonly used HTML numeric entities into their appropriate Unicode value.
Changed the TDIHtmlParser.StopParseAll procedure to a TDIHtmlParser.StopParse property. This must be set to True to stop the current parsing process. It applies to both TDIHtmlParser.ParseAll as well as to TDIHtmlParser.ParseNextPiece, where it cancels an ongoing parsing process which did not yet return to the caller.
Introduced TDIAbstractHtmlAttribsPlugin as ancestor class of TDIHtmlLinksPlugin, which now responds to a much wider range of link combinations, including multiple links contained within a single tag. Applications can also add custom Tag / Attribute combinations to report by calling TDIAbstractHtmlAttribsPlugin.AddAttrib. The TDIHtmlLinksPluginEvent callback definition has changed slightly and requires an interface change to existing applications.
Added a TDIHtmlWriterPlugin.PredefinedEntities option which allows to specify some known predefined entities which will alway be encoded by default when writing HTML text, regardless of other entity registrations.
Shortened procedure name of TDITag.ForceAttribValue to TDITag.ForceAttrib.
TDITag and descendent classes benefit from changes to DIContainers ancestors. This includes speed optimizations as well as some interface simplifications.
Reworked and simplified the entire filtering mechanism while improving power and flexibility: All HTML pieces can be hidden (fiHide), shown only to the parser or the plugin (fiShowLocal), or shown with full visibility (fiShow). In addition, it is possible to show an individual HTML piece depending on wheather the parser and / or any plugin show it (fiShowShown) or hide it (fiShowHidden). This feature is used, for example, by the new TDIHtmlCasePlugin which, to minimize work, only changes the case for those tags which are in fact requested by the application. Also, filtering of individual tags is handled through a separate TDITagFilters object, a single instance of which can be assigned to multiple parsers or plugins to easily synchronize tag filtering. Both the parser and the plugins now share the same filtering syntax, which unfortunately breaks backward compatibility with earlier versions of DIHtmlParser.
Removed the JSTE HTML piece in favour of new Custom-Tags. They take the form <#TagName�AttribName=AttibValue>, where the '#' character is the Custom-Tag start character and can be choosen freely. The corresponding PieceType is ptCustomTag. Custom Tags can be similar to Delphi's TPageProducer HTML-transparent tags, but must not be nested with other HTML pieces. Instead, DIHtmlParser's Custom Tags can be customized by choosing another TDIHtmlParser.CustomTagStartChar. Also, they support start tags as well as End Tags <#/TagName> and Empty Element Tags <#TagName�AttribName=AttribValue�/>.
For improved HTML conformance, parsing of Non-HTML pieces is by default disabled but can easily be enabled during design-time and run-time. Non- HTML pieces are ASP (TDIHtmlParser.EnableASP), Custom Tags (TDIHtmlParser.EnableCustomTags), PHP (TDIHtmlParser.EnablePHP), and SSI (TDIHtmlParser.EnableSSI).
Introduced a new HTML piece CDATA Section, marked as '<![CDATA[' (Char* - (Char* ']]>' Char*)) ']]>'. The corresponding PieceType is ptCDataSection. CDATA Sections are rarely used in HTML only but form part of XML and SGML. They must not be mistaken as character data, which is all the text that is not markup.
Tightened parsing of DTD pieces to '<!DOCTYPE' (Char - '-')* '>'. This basically means that '<!' is no longer recognized as the beginning of a DTD piece, but '<!DOCTYPE' must be fully present to qualify for a Document Type Definition. This extension follows the introduction of the CDATA section piece (see previous paragraph).
Introduced a new HTML piece XML Processing Instruction (XmlPI), marked as '<?XML Char* ?>'. The corresponding PieceType is ptXmlPI. As part of XML, XML Processing Instructions are also part of XHTML. Up to now DIHtmlParser did not properly distinguish between XML and HTML Processing Instructions ( '<? Char* >') which could lead to a missing '?' character in rare occasions.
SSI (Server Side Includes) is now fully parsed into the TDITag object. This simplifies SSI queries for tags and their attributes and values. Accordingly changed the TDIHtmlWriterPlugin, which no longer takes a simple WideString as an argument for writing SSI but an instance of TDITag instead.
Improved recognition of malformed HTML Tags and Custom Tags as text: Formulas like x<y*3 and similar patterns are parsed as text and not as a a <y> tag, for example.
Added a boolean parameter to the TDIHtmlParserPlugin .Handle…(var Show: Boolean) procedures which allows the plugin to determine if a HTML piece is shown by their parser or not. Plugins may therefore act as "filtering" plugins if required.
New TDIHtmlCasePlugin adjusts the case of tags names and attributes names to upper or lower case.
For the TDIHtmlLinksPlugin, changed the TDIHtmlLinksPluginEvent to pass the Link parameter as a variable parameter. This makes it possible directly to change the corresponding attribute value of the underlying TDIHtmlParser.HtmlTag during the TDIHtmlLinksPlugin.OnLink event. Doing so allows to change HTML document links very efficiently.
Extended the TDIHtmlLinksPlugin to recognize the <OBJECT> tag and report the value of its "data" attribute which may contain URLs to the object's data, like images for clickable image maps.
DIHtmlParser takes into account the syntax for scripting macros when writing HTML tag attribute values. This basically means that it does not escape a '&' character occurring in an attribute value immediately followed by a '{' character (see Section B.7.1 of the HTML 4.0.1 Recommendation).
Appended DIUri.ResolveRelativeUriW to remove '.' and '..' segments from absolute-path references just as from relative-path references. This is not part of the specifications, but Internet Explorer does it as well. We do it mostly because it is useful when resolving abnormal relative absolute-path URIs for comparision with already established URIs.
Shortened the names of some TDIUnicodeReader and TDIUnicodeWriter methods.
Shortened and renamed some more properties, methods and procedures for better consitency.
The next single character immediately following a character set meta tag (for example <META http-equiv="content-type" content="charset=GB18030">) was not read according to the new character set specified by a TDIHtmlCharSetPlugin. Adjusted the internal tag parsing routine in a way that the next character is not read before the plugins have been called and DIHtmlParser actually starts parsing the next HTML piece.
Many internal Optimizations.
Added new demo projects and updated existing demo projects to use the new features.
Added TDIHtmlParser.NormalizeWhiteSpace property to control if all text White Space and Control characters outside <PRE>�…�</PRE> elements are condensed into a single White Space character or if they are preserved just like preformatted text. Following the HTML specification, the default setting is to normalize. Text within <PRE>�…�</PRE> elements is not affected by the NormalizeWhiteSpace setting: It will never be normalized, all White Space and Control characters will be preserved.
Added BaseUri property to the TDIHtmlLinksPlugin which monitors <BASE href=…> tags and keeps the base URI of a document up to date.
Added unit DIUri with classes and functions to analyze Uniform Resource Identifiers and to resolve relative URIs.
Added unit DIHtmlColors with functions to convert HTML color strings to RGB color values.
Augmented the Help file and added new Frequently Asked Question section.
Renamed the TDIUnicodeWriter.FlushBuffer method to TDIUnicodeWriter.FlushDataBuffer.
Internal optimizations.
Added auto-generated *.hpp and *.obj files for using DIHtmlParser with C++ Builder. This is for testing purposes only, DIHtmlParser does not officially support C++ Builder.
Corrected the output of HTML tag attribute values for both TDIHtmlTag and TDIHtmlWriterPlugin:
The "<" sign is no longer encoded to "<" to generate shorter output.
The " >" sign causes attribte values to be quoted in order to prevent premature ending of tag parsing in major HTML browsers.
Empty Element Tags will have an additional space character inserted before the tag end marker "/>" if the very last attribute value is not quoted.
Added read support for the Chinese national character set standard GB18030-2000.
Extended the TDIHtmlLinksPlugin to recognize the <SCRIPT> tag and report the value of its "src" attribute which usually contains a URL to a script file read by the HTML browser.
Extended the TDIHtmlWriterPlugin not to write any line breaks around tags if within a <PRE> element. Tracking preformatted text is based upon the TDIHtmlTag.TagID property of the written tags. The <PRE> tag must therefore be registered for this feature to work (RegisterTag(TAG_PRE, TAG_PRE_ID);).
Completely rewritten the TDIHtmlTag to HTML code conversion for both TDIHtmlTag and TDIHtmlWriterPlugin. Single and double quotes within attribute values ( " and ', like used in scripting events) are now properly encoded using entities and / or quoted when necessary, using the shortest possible output method.
Made TDIHtmlTag.Code a property for both reading and writing. Writing expects a valid HTML tag within the string and will update its properties with the first tag encountered. It will clear itself if not tag found. Changed some other function names to TDIHtmlTag.GetCode, TDIHtmlTag.GetStartCode, TDIHtmlTag.GetEmptyElementCode and TDIHtmlTag.GetEndCode.
Reworked the the tag filtering system, which did not operate constently.
Worked around a compiler and linker bug in D4, D5, D6 and D7 where Delphi terminates compilation with an error if any code with WideChar set constants is compiled with TD32 Debug Info enabled.
Memory allocated if characters were read ahead by calling TDIUnicodeReader.PeekChars was not freed. Leak fixed.
Added new Demo project which illustrates how to modify HTML documents.
Code optimizations and bug fixes.
Solved a problem where the HTML tag marker '<' as the final character of a HTML document would lead the parser into an an endless loop.
A single '<' character as the first character of a document and not followed by a valid HTML tag character did not report at all. Changed to correctly reports a text.
Tightened the recognition of well-formed and the detection of mal-formed numerical HTML entities. The parser exits entity parsing as soon as a numerical character reference would overflow the range of a WideChar (i.e. 𐀀 or 𐀀) and could therefore not be properly converted. It also checks that all character entities must end with the ';' character in order to be valid, which could otherwise lead to improperly converted entities, especially when the '&' sign is not escaped in URI attribute values.
Relaxed the parsing of Processing Instructions to recognize the syntax of both the HTML ( <?instruction >) as well as the XML ( <?instruction ?>) specifications. In any case, the TDIHtmlWriterPlugin writes the HTML version.
Generally improved the handling of malformed markup to be reported as text.
Added the properties TDIUnicodeReader.CharPos and TDIHtmlParser.StartPos. They return the absolute character position of the current character / HTML piece within the document and can be useful direct manipulation of the source document.
Added a QuoteTagValues parameter to TDIHtmlTag.Code and TDIHtmlTag.GetStartCode which controls the insertion of quotation marks around the values of HTML tag attributes.
The HtmlExtractTextA function did not copy the right text to the result. TDIHtmlWriterPlugin.WriteComment missed writing the exclamation mark '<!–'. Both fixed.
Renamed some properties for clarity and to avoid possible naming conflicts with descendent classes.
Internal code optimizations and cleanup.
Included the missing example *.htm files for the TablePlugin demo project.
Added full Unicode / WideString support throughout the entire library. Updated all classes and routines to directly handle and use Unicode / WideStrings. All HTML data is now reported using WideStings. The change to Unicode required an overall new structure of the entire library, so DIHtmlParser 2.00 is not fully backward compatible to Version 1.xx. The changes mostly apply to string types (AnsiString vs. WideString) and some function names.
Added native implementation of more than 40 character sets for both reading and writing. These include Utf-8, Utf-16-LE, Utf-16-BE, and most ISO, Windows and DOS character sets. The new TDIHtmlCharSetPlugin can automatically watch out for character set changes and update an associated HTML parser accordingly. All demo projects are updated accordingly.
Added entity conversion for the values of a HTML tag attribute. Removed option to switch off entity conversion. If you really do not want to convert any entities at all, simple do not register any.
Improved the proper handling of malformed HTML tags and entities. Standalone "<" characters are recognized as text instead of the beginning of an HTML tag. Character entities will only be converted if they include a final ";" character. Otherwise they are parsed as simple text.
The SCRIPT and STYLE tags report as individual tags, even when reporting for scripts or styles is turned off. Changed script and style reporting to report the contents of the script / style only instead of linking it with the tag.
Added a HTML piece for the contents of the TITLE tag (ptTitle) as it is parsed differently from normal text.
Introduced the new DIUtils unit due to the many new WideString functions and procedures. It also contains most of the utility functions previously splattered around different units. Therefore, some units are gone (i.e. DIStrings.pas) and many names have changed to allow for more coherent naming conventions.
Extended the TDIHtmlLinksPlugin to recognize the "BGSOUND" tag and report its "src" link.
Added packages for easy installation, updated the demo projects to reflect the Unicode features, and created a project group for all demo projects.
Tag parsing improvements: Whitespace after a HTML tag and its attributes but in front of the terminating '>' character is now being ignored.
Script parsing improvements: Script bodys are are now parsed for literal strings delimited by single and double quotation marks ( ' and ") as well as comments (single line ( ) and multiple line (/* and */). This avoids an end of script marker (</SCRIPT>'') inside a literal string or comment to wrongly end a script. Since there seems to be no cristal-clear syntax on HTML scripting, we have made an effort to simulate common HTML browser script parsing. If in doubt, please re-run your applications on sample HTML data to see the effects.
* Added new demo project for the TDIHtmlTablesPlugin.
* Compiled with the updated DIContainers 1.10.
=====DIHtmlParser 1.01 – 17. January 2002=====
* Fixed a minor rangechecking issue.
* Added new demo project for the TDIHtmlWriterPlugin.
* Compiled with the updated DIContainers 1.01.
=====DIHtmlParser 1.00 – 11. December 2001=====
* Compiled with the updated DIContainers 1.00. Many improvements and changes there, from which DIHtmlParser benefits. However, they affect mostly the DIHtmlParser internals.
* Miscellaneous enhancements and bug fixes.
products/htmlparser/history.txt · Last modified: 2010/04/25 12:49 (external edit)
|