Groups | Blog | Home
all groups > dotnet clr > november 2006 >

dotnet clr : Char.IsPunctuation vs. CRT is(w)punct


Jeff Pek (Autodesk)
11/6/2006 7:40:31 AM
Hi all -

A Kb article indicates that Char.IsPunctuation is the "equivalent" of the
CRT's isXpunct (e.g., iswpunct) function in .NET. However, I've found
significant differences in their behaviors. As a test, I ran each function
through the first 1000 or so unicode characters, and found the results that
follow. It identifies that characters for which the 2 functions returned
different results, and shows what the .NET method said. I'm sure there are
other differences later on in the character set.

So far, I haven't seen any documentation regarding the specific differences.
I wonder if anything exists. Thanks for any pointers.

Regards,
Jeff

--------

IsPunctuation mismatch: ! (33). .NET says: True
IsPunctuation mismatch: " (34). .NET says: True
IsPunctuation mismatch: # (35). .NET says: True
IsPunctuation mismatch: $ (36). .NET says: False
IsPunctuation mismatch: % (37). .NET says: True
IsPunctuation mismatch: & (38). .NET says: True
IsPunctuation mismatch: ' (39). .NET says: True
IsPunctuation mismatch: ( (40). .NET says: True
IsPunctuation mismatch: ) (41). .NET says: True
IsPunctuation mismatch: * (42). .NET says: True
IsPunctuation mismatch: + (43). .NET says: False
IsPunctuation mismatch: , (44). .NET says: True
IsPunctuation mismatch: - (45). .NET says: True
IsPunctuation mismatch: . (46). .NET says: True
IsPunctuation mismatch: / (47). .NET says: True
IsPunctuation mismatch: : (58). .NET says: True
IsPunctuation mismatch: ; (59). .NET says: True
IsPunctuation mismatch: < (60). .NET says: False
IsPunctuation mismatch: = (61). .NET says: False
IsPunctuation mismatch: > (62). .NET says: False
IsPunctuation mismatch: ? (63). .NET says: True
IsPunctuation mismatch: @ (64). .NET says: True
IsPunctuation mismatch: [ (91). .NET says: True
IsPunctuation mismatch: \ (92). .NET says: True
IsPunctuation mismatch: ] (93). .NET says: True
IsPunctuation mismatch: ^ (94). .NET says: False
IsPunctuation mismatch: _ (95). .NET says: True
IsPunctuation mismatch: ` (96). .NET says: False
IsPunctuation mismatch: { (123). .NET says: True
IsPunctuation mismatch: | (124). .NET says: False
IsPunctuation mismatch: } (125). .NET says: True
IsPunctuation mismatch: ~ (126). .NET says: False
IsPunctuation mismatch: ­ (161). .NET says: True
IsPunctuation mismatch: > (162). .NET says: False
IsPunctuation mismatch: o (163). .NET says: False
IsPunctuation mismatch:  (164). .NET says: False
IsPunctuation mismatch:  (165). .NET says: False
IsPunctuation mismatch: Ý (166). .NET says: False
IsPunctuation mismatch:  (167). .NET says: False
IsPunctuation mismatch: " (168). .NET says: False
IsPunctuation mismatch: c (169). .NET says: False
IsPunctuation mismatch: ¦ (170). .NET says: False
IsPunctuation mismatch: ® (171). .NET says: True
IsPunctuation mismatch: ª (172). .NET says: False
IsPunctuation mismatch: - (173). .NET says: True
IsPunctuation mismatch: r (174). .NET says: False
IsPunctuation mismatch: _ (175). .NET says: False
IsPunctuation mismatch: ø (176). .NET says: False
IsPunctuation mismatch: ñ (177). .NET says: False
IsPunctuation mismatch: ý (178). .NET says: False
IsPunctuation mismatch: 3 (179). .NET says: False
IsPunctuation mismatch: ' (180). .NET says: False
IsPunctuation mismatch: æ (181). .NET says: False
IsPunctuation mismatch:  (182). .NET says: False
IsPunctuation mismatch: ú (183). .NET says: True
IsPunctuation mismatch: , (184). .NET says: False
IsPunctuation mismatch: 1 (185). .NET says: False
IsPunctuation mismatch: § (186). .NET says: False
IsPunctuation mismatch: ¯ (187). .NET says: True
IsPunctuation mismatch: ¬ (188). .NET says: False
IsPunctuation mismatch: « (189). .NET says: False
IsPunctuation mismatch: _ (190). .NET says: False
IsPunctuation mismatch: ¨ (191). .NET says: True
IsPunctuation mismatch: x (215). .NET says: False
IsPunctuation mismatch: ö (247). .NET says: False
IsPunctuation mismatch: ; (894). .NET says: True
IsPunctuation mismatch: ? (903). .NET says: True

RobinS
11/6/2006 9:28:38 AM
Well, technically, the ones that .Net is not marking
as punctuation are NOT punctuation. In what sentence
do you use > or = or < or << or $ as punctuation?

You might check out Char.IsWhiteSpace to take out
some of the weird control characters.

What exactly are you trying to accomplish?

Robin S.
-----------------------------------------

[quoted text, click to view]

Jeff Pek (Autodesk)
11/6/2006 1:25:05 PM
I agree. The issue here is that there is some existing C++ code that I'm
trying to refactor and use within a C# library. I'd like to have equivalent
functionality; this is one important aspect of accomplishing that.

I could use a C++/CLI module to ensure equivalent behavior, but I'd like to
avoid that.

Thanks for the response.

Jeff

[quoted text, click to view]

RobinS
11/6/2006 7:40:02 PM
So are you just trying to clear all the junk out of a string,
or you want to know if there's junk in the string, or what?

If that's the case, you could write a function to do that, and
just call it. Would that work?

Robin S.


[quoted text, click to view]

Ben Voigt
11/7/2006 3:05:11 PM

[quoted text, click to view]

I think you could just p/invoke _iswpunct from MSVCRT80.DLL, if there's a
function definition and not just a macro.

[quoted text, click to view]

Chris Mullins
11/7/2006 5:23:34 PM
Once you hit Unicode land, I think determining punctuation is difficult.
There is a good answer though:
Stringprep - http://www.ietf.org/rfc/rfc3454.txt

Stringprep addresses case folding, whitespace, prohibited characters,
bidirectional validity, and normalization form.

An example profile is nameprep, which is how Internationalized Domain Names
work:
http://tools.ietf.org/html/rfc3491

Another example profile is "resourceprep" which is part of the XMPP
standard:
http://www.xmpp.org/internet-drafts/attic/draft-ietf-xmpp-resourceprep-03.html

For example, this profile prohibits all characters in :
Table C.1.2
Table C.2.1
Table C.2.2
Table C.3
Table C.4
Table C.5
Table C.6
Table C.7
Table C.8
Table C.9

It specifies unicode normalication form KC, and that bidirectional checking
must be performed.

--
Chris Mullins

[quoted text, click to view]

Chris Mullins
11/7/2006 5:33:31 PM

I should add there is an open-source C# implementation of stringprep that
part of libidn. This implementation is a bit memory hungry, and not exactly
tuned for optimal performance, but it works.


--
Chris Mullins, MCSD.NET, MCPD:Enterprise
http://www.coversant.net/blogs/cmullins


[quoted text, click to view]

Jeff Pek (Autodesk)
11/9/2006 5:34:02 PM
Thanks, all. This is all good stuff. What I was trying to do was to mimic
the behavior of iswpunct (and therefore the existing code). PInvoking
iswpunct seems reasonable, provided that I know that that DLL is going to be
there.

- jp

[quoted text, click to view]

Ben Voigt
11/10/2006 10:41:32 AM

[quoted text, click to view]

MSVCRT.DLL has been distributed with recent versions of windows, and service
packs for not-so-recent versions, and it exports all the character
classification functions.

[quoted text, click to view]

AddThis Social Bookmark Button