--- lib/HTML/Encoding.pm.orig 2011-03-27 14:53:03.000000000 -0700
+++ lib/HTML/Encoding.pm 2011-03-27 15:28:44.000000000 -0700
@@ -523,20 +523,20 @@
=head1 WARNING
-The interface and implementation are guranteed to change before this
+The interface and implementation are guaranteed to change before this
module reaches version 1.00! Please send feedback to the author of
this module.
=head1 DESCRIPTION
HTML::Encoding helps to determine the encoding of HTML and XML/XHTML
-documents...
+documents.
=head1 DEFAULT ENCODINGS
-Most routines need to know some suspected character encodings which
+Most routines need to know some suspected character encodings; these
can be provided through the C option. This option always
-defaults to the $HTML::Encoding::DEFAULT_ENCODINGS array reference
+defaults to the $HTML::Encoding::DEFAULT_ENCODINGS array reference,
which means the following encodings are considered by default:
* ISO-8859-1
@@ -546,7 +546,7 @@
* UTF-32BE
* UTF-8
-If you change the values or pass custom values to the routines note
+If you change the values or pass custom values to the routines, note
that L must support them in order for this module to work
correctly.
@@ -554,7 +554,7 @@
C, C, and
C return in list context the encoding
-source and the encoding name, possible encoding sources are
+source and the encoding name. Possible encoding sources are:
* protocol (Content-Type: text/html;charset=encoding)
* bom (leading U+FEFF)
@@ -565,21 +565,21 @@
=head1 ROUTINES
-Routines exported by this module at user option. By default, nothing
-is exported.
+Routines may be exported by this module at the user's option. By
+default, nothing is exported.
=over 2
=item encoding_from_content_type($content_type)
Takes a byte string and uses L to extract the
-charset parameter from the C header value and returns
+charset parameter from the C header value. Returns
its value or C (or an empty list in list context) if there
is no such value. Only the first component will be examined
(HTTP/1.1 only allows for one component), any backslash escapes in
strings will be unescaped, all leading and trailing quote marks
and white-space characters will be removed, all white-space will be
-collapsed to a single space, empty charset values will be ignored
+collapsed to a single space, empty charset values will be ignored,
and no case folding is performed.
Examples:
@@ -596,28 +596,28 @@
| "text/html;charset=\" UTF-8 \"" | 'UTF-8' |
+-----------------------------------------+-----------+
-If you pass a string with the UTF-8 flag turned on the string will
+If you pass a string with the UTF-8 flag turned on, the string will
be converted to bytes before it is passed to L.
-The return value will thus never have the UTF-8 flag turned on (this
-might change in future versions).
+The return value will thus never have the UTF-8 flag turned on. (This
+might change in future versions.)
=item encoding_from_byte_order_mark($octets [, %options])
-Takes a sequence of octets and attempts to read a byte order mark
-at the beginning of the octet sequence. It will go through the list
-of $options{encodings} or the list of default encodings if no
-encodings are specified and match the beginning of the string against
-any byte order mark octet sequence found.
-
-The result can be ambiguous, for example qq(\xFF\xFE\x00\x00) could
-be both, a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a
-U+0000 character. It is also possible that C<$octets> starts with
+Takes a sequence of octets and attempts to read a byte order mark at the
+beginning of the octet sequence. It will go through the list of
+$options{encodings} (or the list of default encodings if no encodings
+are specified) and match the beginning of the string against any byte
+order mark octet sequence found.
+
+The result can be ambiguous. For example, qq(\xFF\xFE\x00\x00) could
+be either a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a
+U+0000 character. It is also possible for C<$octets> to start with
something that looks like a byte order mark but actually is not.
-encoding_from_byte_order_mark sorts the list of possible encodings
-by the length of their BOM octet sequence and returns in scalar
-context only the encoding with the longest match, and all encodings
-ordered by length of their BOM octet sequence in list context.
+encoding_from_byte_order_mark sorts the list of possible encodings by
+the length of their BOM octet sequence. In scalar context, it returns
+only the encoding with the longest match. In list context, it returns
+all encodings ordered by length of their BOM octet sequence.
Examples:
@@ -634,9 +634,9 @@
| "\x2B\x2F\x76\x38\x2D" | UTF-7 | qw(UTF-7) |
+-------------------------+------------+-----------------------+
-Note however that for UTF-7 it is in theory possible that the U+FEFF
-combines with other characters in which case such detection would fail,
-for example consider:
+Note, however, that for UTF-7 it is theoretically possible for U+FEFF to
+combine with other characters, in which case such detection would fail.
+For example, consider:
+--------------------------------------+-----------+-----------+
| Input | Encodings | Result |
@@ -649,15 +649,17 @@
relevant for most applications as there should never be need to use
UTF-7 in the encoding list for existing documents.
-If no BOM can be found it returns C in scalar context and an
-empty list in list context. This routine should not be used with
-strings with the UTF-8 flag turned on.
+If no BOM can be found, it returns C in scalar context or an
+empty list in list context.
+
+This routine should not be used with strings with the UTF-8 flag turned
+on.
=item encoding_from_xml_declaration($declaration)
Attempts to extract the value of the encoding pseudo-attribute in an XML
declaration or text declaration in the character string $declaration. If
-there does not appear to be such a value it returns nothing. This would
+there does not appear to be such a value, it returns nothing. This would
typically be used with the return values of xml_declaration_from_octets.
Normalizes whitespaces like encoding_from_content_type.
@@ -688,12 +690,14 @@
=item xml_declaration_from_octets($octets [, %options])
Attempts to find a ">" character in the byte string $octets using the
-encodings in $encodings and upon success attempts to find a preceding
-"<" character. Returns all the strings found this way in the order of
-number of successful matches in list context and the best match in
-scalar context. Should probably be combined with the only user of this
-routine, encoding_from_xml_declaration... You can modify the list of
-suspected encodings using $options{encodings};
+encodings in $encodings, and upon success attempts to find a preceding
+"<" character. In list context, returns all the strings found this way
+in the order of number of successful matches; or in scalar context,
+returns the best match. You can modify the list of suspected encodings
+using $options{encodings};
+
+(Should probably be combined with the only user of this routine,
+encoding_from_xml_declaration...)
=item encoding_from_first_chars($octets [, %options])
@@ -707,9 +711,9 @@
document is a HTML document) to get at least a base encoding which can be
used to decode enough of the document to find elements using
encoding_from_meta_element. $options{whitespace} defaults to qw/CR LF SP TB/.
-Returns nothing if unsuccessful. Returns the matching encodings in order
-of the number of octets matched in list context and the best match in
-scalar context.
+Returns nothing if unsuccessful. In list context, returns the matching
+encodings in order of the number of octets matched. In scalar context,
+returns the best match.
Examples:
@@ -742,14 +746,14 @@
are found, uses encoding_from_content_type to extract the charset
-parameter. It returns all such encodings it could find in document
-order in list context or the first encoding in scalar context (it
-will currently look for others regardless of calling context) or
-nothing if that fails for some reason.
+parameter. It returns (in list context) all such encodings it could find
+in document order, or (in scalar context) the first encoding, or nothing
+if that fails for some reason. (Currently it will look for any and all
+encodings even when called in scalar context.)
-Note that there are many edge cases where this does not yield in
+Note that there are many edge cases where this does not yield
"proper" results depending on the capabilities of the HTML::Parser
-version and the options you pass for it, for example,
+version and the options you pass for it. For example:
@@ -759,19 +763,19 @@
...
This would likely not detect the C value if HTML::Parser
-does not resolve the entity. This should however only be a concern
+does not resolve the entity. This should, however, only be a concern
for documents specifically crafted to break the encoding detection.
=item encoding_from_xml_document($octets, [, %options])
-Uses encoding_from_byte_order_mark to detect the encoding using a
-byte order mark in the byte string and returns the return value of
-that routine if it succeeds. Uses xml_declaration_from_octets and
-encoding_from_xml_declaration and returns the encoding for which
-the latter routine found most matches in scalar context, and all
-encodings ordered by number of occurences in list context. It
-does not return a value of neither byte order mark not inbound
-declarations declare a character encoding.
+Uses encoding_from_byte_order_mark to detect the encoding using a byte
+order mark in the byte string. Returns the return value of that routine
+if it succeeds. Uses xml_declaration_from_octets and
+encoding_from_xml_declaration, and (in scalar context) returns the
+encoding for which the latter routine found most matches, or (in list
+context) all encodings ordered by number of occurences. It does not
+return a value of neither byte order mark not inbound declarations
+declare a character encoding.
Examples:
@@ -787,12 +791,12 @@
+----------------------------+----------+-----------+----------+
Lacking a return value from this routine and higher-level protocol
-information (such as protocol encoding defaults) processors would
+information (such as protocol encoding defaults), processors would
be required to assume that the document is UTF-8 encoded.
-Note however that the return value depends on the set of suspected
+Note, however, that the return value depends on the set of suspected
encodings you pass to it. For example, by default, EBCDIC encodings
-would not be considered and thus for
+would not be considered, and thus for
@@ -803,7 +807,7 @@
Uses encoding_from_xml_document and encoding_from_meta_element to
determine the encoding of HTML documents. If $options{xhtml} is
-set to a false value uses encoding_from_byte_order_mark and
+set to a false value, uses encoding_from_byte_order_mark and
encoding_from_meta_element to determine the encoding. The xhtml
option is on by default. The $options{encodings} can be used to
modify the suspected encodings and $options{parser_options} can
@@ -811,13 +815,13 @@
encoding_from_meta_element (see the relevant documentation).
Returns nothing if no declaration could be found, the winning
-declaration in scalar context and a list of encoding source
-and encoding name in list context, see ENCODING SOURCES.
+declaration in scalar context, or a list of encoding source
+and encoding name in list context. See L"ENCODING SOURCES">.
...
Other problems arise from differences between HTML and XHTML syntax
-and encoding detection rules, for example, the input could be
+and encoding detection rules. For example, the input could be:
Content-Type: text/html
@@ -829,14 +833,14 @@
...
-This is a perfectly legal HTML 4.01 document and implementations
-might be expected to consider the document ISO-8859-2 encoded as
-XML rules for encoding detection do not apply to HTML documents.
-This module attempts to avoid making decisions which rules apply
-for a specific document and would thus by default return 'utf-8'
-for this input.
+This is a perfectly legal HTML 4.01 document and implementations might
+be expected to consider the document to have ISO-8859-2 encoding, as XML
+rules for encoding detection do not apply to HTML documents. This
+module attempts to avoid making decisions on which rules apply for a
+specific document, and would thus by default return 'utf-8' for this
+input.
-On the other hand, if the input omits the encoding declaration,
+On the other hand, if the input omits the encoding declaration, thus:
Content-Type: text/html
@@ -848,8 +852,10 @@
...
-It would return 'iso-8859-2'. Similar problems would arise from
-other differences between HTML and XHTML, for example consider
+it would return 'iso-8859-2'.
+
+Similar problems would arise from other differences between HTML and
+XHTML. For example, consider:
Content-Type: text/html
@@ -864,69 +870,70 @@
If this is processed using HTML rules, the first > will end the
processing instruction and the XHTML document type declaration
-would be the relevant declaration for the document, if it is
+would be the relevant declaration for the document. If it is
processed using XHTML rules, the ?> will end the processing
instruction and the HTML document type declaration would be the
relevant declaration.
-IOW, an application would need to assume a certain character
-encoding (family) to process enough of the document to determine
-whether it is XHTML or HTML and the result of this detection would
-depend on which processing rules are assumed in order to process it.
-It is thus in essence not possible to write a "perfect" detection
-algorithm, which is why this routine attempts to avoid making any
-decisions on this matter.
+In other words, an application would need to assume a certain character
+encoding (family) to process enough of the document to determine whether
+it is XHTML or HTML, and the result of this detection would depend on
+which processing rules are assumed in order to process it. It is thus
+in essence not possible to write a "perfect" detection algorithm, which
+is why this routine attempts to avoid making any decisions on this
+matter.
=item encoding_from_http_message($message [, %options])
-Determines the encoding of HTML / XML / XHTML documents enclosed
-in HTTP message. $message is an object compatible to L,
-e.g. a L object. %options is a hash with the following
-possible entries:
+Determines the encoding of HTML/XML/XHTML documents enclosed in an HTTP
+message. $message is an object compatible withL, e.g. a
+L object. %options is a hash with the following possible
+entries:
=over 2
=item encodings
-array references of suspected character encodings, defaults to
+Array references of suspected character encodings; defaults to
C<$HTML::Encoding::DEFAULT_ENCODINGS>.
=item is_html
Regular expression matched against the content_type of the message
-to determine whether to use HTML rules for the entity body, defaults
+to determine whether to use HTML rules for the entity body; defaults
to C.
=item is_xml
Regular expression matched against the content_type of the message
-to determine whether to use XML rules for the entity body, defaults
+to determine whether to use XML rules for the entity body; defaults
to C.
=item is_text_xml
Regular expression matched against the content_type of the message
-to determine whether to use text/html rules for the message, defaults
+to determine whether to use text/html rules for the message; defaults
to C. This will only be checked if is_xml
-matches aswell.
+matches as well.
=item html_default
-Default encoding for documents determined (by is_html) as HTML,
+Default encoding for documents determined (by is_html) as HTML;
defaults to C.
=item xml_default
-Default encoding for documents determined (by is_xml) as XML,
+Default encoding for documents determined (by is_xml) as XML;
defaults to C.
=item text_xml_default
-Default encoding for documents determined (by is_text_xml) as text/xml,
-defaults to C in which case the default is ignored. This should
-be set to C if desired as this module is by default
-inconsistent with RFC 3023 which requires that for text/xml documents
-without a charset parameter in the HTTP header C is assumed.
+Default encoding for documents determined (by is_text_xml) as text/xml;
+defaults to C, in which case the default is ignored. This should
+be set to C if desired, as this module is by default
+inconsistent with RFC 3023; that RFC requires that for text/xml
+documents without a charset parameter in the HTTP header, C is
+assumed.
This requirement is inconsistent with RFC 2616 (HTTP/1.1) which requires
to assume C, has been widely ignored and is thus disabled by
@@ -935,18 +942,18 @@
=item xhtml
Whether the routine should look for an encoding declaration in the
-XML declaration of the document (if any), defaults to C<1>.
+XML declaration of the document (if any); defaults to C<1>.
=item default
Whether the relevant default value should be returned when no other
-information can be determined, defaults to C<1>.
+information can be determined; defaults to C<1>.
=back
-This is furhter possibly inconsistent with XML MIME types that differ
-in other ways from application/xml, for example if the MIME Type does
-not allow for a charset parameter in which case applications might be
+This is possibly further inconsistent with XML MIME types that differ
+in other ways from application/xml (for example, if the MIME type does
+not allow for a charset parameter), in which case applications might be
expected to ignore the charset parameter if erroneously provided.
=back
@@ -954,17 +961,17 @@
=head1 EBCDIC SUPPORT
By default, this module does not support EBCDIC encodings. To enable
-support for EBCDIC encodings you can either change the
+support for EBCDIC encodings, you can either change the
$HTML::Encodings::DEFAULT_ENCODINGS array reference or pass the
-encodings to the routines you use using the encodings option, for
-example
+encodings to the routines you use using the encodings option; for
+example:
my @try = qw/UTF-8 UTF-16LE cp500 posix-bc .../;
my $enc = encoding_from_xml_document($doc, encodings => \@try);
Note that there are some subtle differences between various EBCDIC
-encodings, for example C is mapped to 0x5A in C and
-to 0x4F in C; these differences might affect processing in
+encodings. For example, C is mapped to 0x5A in C and
+to 0x4F in C. These differences might affect processing in
yet undetermined ways.
=head1 TODO
@@ -994,4 +1001,8 @@
Copyright (c) 2004-2008 Bjoern Hoehrmann .
This module is licensed under the same terms as Perl itself.
+ This document has been edited for grammar, spelling, and clarity by
+ Larry Gilbert for the MacPorts Project. (Some
+ especially opaque passages have been left alone.)
+
=cut