1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
|
\input texinfo @c -*- mode: texinfo; coding: us-ascii; -*-
@c This file is part of GNU Libidn.
@c See below for copyright and license.
@setfilename libidn.info
@documentencoding UTF-8
@include version.texi
@settitle GNU Libidn @value{VERSION}
@finalout
@syncodeindex pg cp
@copying
This manual is last updated @value{UPDATED} for version
@value{VERSION} of GNU Libidn.
Copyright @copyright{} 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
2010, 2011 Simon Josefsson.
@quotation
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or
any later version published by the Free Software Foundation; with no
Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
copy of the license is included in the section entitled ``GNU Free
Documentation License''.
@end quotation
@end copying
@dircategory Software libraries
@direntry
* libidn: (libidn). Internationalized string processing library.
@end direntry
@dircategory Localization
@direntry
* idn: (libidn)Invoking idn. Internationalized Domain Name (IDN) string conversion.
@end direntry
@dircategory Emacs
@direntry
* IDN Library: (libidn)Emacs API. Emacs API for IDN functions.
@end direntry
@titlepage
@title GNU Libidn
@subtitle Internationalized string processing for the GNU system
@subtitle for version @value{VERSION}, @value{UPDATED}
@author Simon Josefsson
@page
@vskip 0pt plus 1filll
@insertcopying
@end titlepage
@contents
@ifnottex
@node Top
@top GNU Libidn
@insertcopying
@end ifnottex
@menu
* Introduction:: How to use this manual.
* Preparation:: What you should do before using the library.
* Utility Functions:: Unicode transformation utility functions.
* Stringprep Functions:: Stringprep functions.
* Punycode Functions:: Punycode functions.
* IDNA Functions:: IDNA functions.
* TLD Functions:: TLD functions.
* PR29 Functions:: Detect strings non-idempotent under NFKC.
* Examples:: Demonstrate how to use the library.
* Invoking idn:: Command line interface to the library.
* Emacs API:: Emacs Lisp API for Libidn.
* Java API:: Notes on the Java port of Libidn.
* C# API:: Notes on the C# port of Libidn.
* Acknowledgements:: Whom to blame.
* History:: Rough outline of development history.
Appendices
* PR29 discussion:: Implementation aspects of the PR29 flaw.
* On Label Separators:: Discussions of a flaw in the IDNA spec.
* Copying Information:: License text covering the Libidn library.
Indices
* Function and Variable Index::
* Concept Index::
@end menu
@node Introduction
@chapter Introduction
GNU Libidn is a fully documented implementation of the Stringprep,
Punycode and IDNA specifications. Libidn's purpose is to encode and
decode internationalized domain names. The native C, C# and Java
libraries are available under the GNU Lesser General Public License
version 2.1 or later (@pxref{GNU LGPL}).
The library contains a generic Stringprep implementation. Profiles
for Nameprep, iSCSI, SASL, XMPP and Kerberos V5 are included.
Punycode and ASCII Compatible Encoding (ACE) via IDNA are supported.
A mechanism to define Top-Level Domain (TLD) specific validation
tables, and to compare strings against those tables, is included.
Default tables for some TLDs are also included.
The Stringprep API consists of two main functions, one for converting
data from the system's native representation into UTF-8, and one
function to perform the Stringprep processing. Adding a new
Stringprep profile for your application within the API is
straightforward. The Punycode API consists of one encoding function
and one decoding function. The IDNA API consists of the ToASCII and
ToUnicode functions, as well as an high-level interface for converting
entire domain names to and from the ACE encoded form. The TLD API
consists of one set of functions to extract the TLD name from a domain
string, one set of functions to locate the proper TLD table to use
based on the TLD name, and core functions to validate a string against
a TLD table, and some utility wrappers to perform all the steps in one
call.
The library is used by, e.g., GNU SASL and Shishi to process user
names and passwords. Libidn can be built into GNU Libc to enable a
new system-wide getaddrinfo flag for IDN processing.
Libidn is developed for the GNU/Linux system, but runs on over 20 Unix
platforms (including Solaris, IRIX, AIX, and Tru64) and Windows. The
library is written in C and (parts of) the API is also accessible from
C++, Emacs Lisp, Python and Java. A native Java and C# port is
included.
Also included is a command line tool, several self tests, code
examples, and more, all licensed under the GNU General Public License
version 3.0 or later (@pxref{GNU GPL}).
@menu
* Getting Started::
* Features::
* Library Overview::
* Supported Platforms::
* Getting help::
* Commercial Support::
* Downloading and Installing::
* Bug Reports::
* Contributing::
@end menu
@node Getting Started
@section Getting Started
This manual documents the library programming interface. All
functions and data types provided by the library are explained.
Included are also examples, and documentation for the command line
tool @file{idn} that provide a quick interface to the library. The
Emacs Lisp bindings for the library is also discussed.
The reader is assumed to possess basic familiarity with
internationalization concepts and network programming in C or C++.
This manual can be used in several ways. If read from the beginning
to the end, it gives a good introduction into the library and how it
can be used in an application. Forward references are included where
necessary. Later on, the manual can be used as a reference manual to
get just the information needed about any particular interface of the
library. Experienced programmers might want to start looking at the
examples at the end of the manual (@pxref{Examples}), and then only
read up those parts of the interface which are unclear.
@node Features
@section Features
This library might have a couple of advantages over other libraries
doing a similar job.
@table @asis
@item It's Free Software
Anybody can use, modify, and redistribute it under the terms of the
GNU Lesser General Public License version 2.1 or later (@pxref{GNU
LGPL}).
@item It's thread-safe
No global state is kept in the library. All functions are re-entrant.
@item It's portable
The code is intended to be written in pure ANSI C89. It has been
tested on many Unix like operating systems, and Windows.
@item It's modularized
The library is composed of several modules, and the only interaction
between modules is through each modules' public API. If you only need
one piece of functionality, it is possible to take the files you need
and incorporate them into your own project.
@item It's not bloated
The design of the library is based on the smallest API necessary to
implement the basic functionality. It has been carefully extended
with a small number of high-level wrappers to make it comfortable to
use the library. However, it does not implement additional
functionality just for the sake of completeness.
@item It's documented
Sadly, not all software comes with documentation these days. This one
does.
@end table
@node Library Overview
@section Library Overview
The following illustration show the components that make up Libidn,
and how your application relates to the library. In the illustration,
various components are shown as boxes. You see the generic StringPrep
component, the various StringPrep profiles including Nameprep, the
Punycode component, the IDNA component, and the TLD component. The
arrows indicate aggregation, e.g., IDNA uses Punycode and Nameprep,
and in turn Nameprep uses the generic StringPrep interface. The
interfaces to all components are available for applications, no
component within the library is hidden from the application.
@image{libidn-components}
@node Supported Platforms
@section Supported Platforms
Libidn has at some point in time been tested on the following
platforms. Build reports for each platforms and Libidn version is
available at @url{http://autobuild.josefsson.org/libidn/}.
@enumerate
@item Debian GNU/Linux 3.0 (Woody)
@cindex Debian
GCC 2.95.4 and GNU Make. This is the main development platform.
@code{alphaev67-unknown-linux-gnu}, @code{alphaev6-unknown-linux-gnu},
@code{arm-unknown-linux-gnu}, @code{armv4l-unknown-linux-gnu},
@code{hppa-unknown-linux-gnu}, @code{hppa64-unknown-linux-gnu},
@code{i686-pc-linux-gnu}, @code{ia64-unknown-linux-gnu},
@code{m68k-unknown-linux-gnu}, @code{mips-unknown-linux-gnu},
@code{mipsel-unknown-linux-gnu}, @code{powerpc-unknown-linux-gnu},
@code{s390-ibm-linux-gnu}, @code{sparc-unknown-linux-gnu},
@code{sparc64-unknown-linux-gnu}.
@item Debian GNU/Linux 2.1
@cindex Debian
GCC 2.95.1 and GNU Make. @code{armv4l-unknown-linux-gnu}.
@item Tru64 UNIX
@cindex Tru64
Tru64 UNIX C compiler and Tru64 Make. @code{alphaev67-dec-osf5.1},
@code{alphaev68-dec-osf5.1}.
@item SuSE Linux 7.1
@cindex SuSE
GCC 2.96 and GNU Make. @code{alphaev6-unknown-linux-gnu},
@code{alphaev67-unknown-linux-gnu}.
@item SuSE Linux 7.2a
@cindex SuSE Linux
GCC 3.0 and GNU Make. @code{ia64-unknown-linux-gnu}.
@item SuSE Linux
@cindex SuSE Linux
GCC 3.2.2 and GNU Make. @code{x86_64-unknown-linux-gnu} (AMD64
Opteron ``Melody'').
@item SuSE Enterprise Server 9 on IBM OpenPower 720
@cindex SuSE Linux
@cindex OpenPower 720
GCC 3.3.3 and GNU Make. @code{powerpc64-unknown-linux-gnu}.
@item RedHat Linux 7.2
@cindex RedHat
GCC 2.96 and GNU Make. @code{alphaev6-unknown-linux-gnu},
@code{alphaev67-unknown-linux-gnu}, @code{ia64-unknown-linux-gnu}.
@item RedHat Linux 8.0
@cindex RedHat
GCC 3.2 and GNU Make. @code{i686-pc-linux-gnu}.
@item RedHat Advanced Server 2.1
@cindex RedHat Advanced Server
GCC 2.96 and GNU Make. @code{i686-pc-linux-gnu}.
@item Slackware Linux 8.0.01
@cindex RedHat
GCC 2.95.3 and GNU Make. @code{i686-pc-linux-gnu}.
@item Mandrake Linux 9.0
@cindex Mandrake
GCC 3.2 and GNU Make. @code{i686-pc-linux-gnu}.
@item IRIX 6.5
@cindex IRIX
MIPS C compiler, IRIX Make. @code{mips-sgi-irix6.5}.
@item AIX 4.3.2
@cindex AIX
IBM C for AIX compiler, AIX Make. @code{rs6000-ibm-aix4.3.2.0}.
@item Microsoft Windows 2000 (Cygwin)
@cindex Windows
GCC 3.2, GNU make. @code{i686-pc-cygwin}.
@item HP-UX 11
@cindex HP-UX
HP-UX C compiler and HP Make. @code{ia64-hp-hpux11.22},
@code{hppa2.0w-hp-hpux11.11}.
@item SUN Solaris 2.7
@cindex Solaris
GCC 3.0.4 and GNU Make. @code{sparc-sun-solaris2.7}.
@item SUN Solaris 2.8
@cindex Solaris
Sun WorkShop Compiler C 6.0 and SUN Make. @code{sparc-sun-solaris2.8}.
@item SUN Solaris 2.9
@cindex Solaris
Sun Forte Developer 7 C compiler and GNU
Make. @code{sparc-sun-solaris2.9}.
@item NetBSD 1.6
@cindex NetBSD
GCC 2.95.3 and GNU Make. @code{alpha-unknown-netbsd1.6},
@code{i386-unknown-netbsdelf1.6}.
@item OpenBSD 3.1 and 3.2
@cindex OpenBSD
GCC 2.95.3 and GNU Make. @code{alpha-unknown-openbsd3.1},
@code{i386-unknown-openbsd3.1}.
@item FreeBSD 4.7 and 4.8
@cindex FreeBSD
GCC 2.95.4 and GNU Make. @code{alpha-unknown-freebsd4.7},
@code{alpha-unknown-freebsd4.8}, @code{i386-unknown-freebsd4.7},
@code{i386-unknown-freebsd4.8}.
@item MacOS X 10.2 Server Edition
@cindex MacOS X
GCC 3.1 and GNU Make. @code{powerpc-apple-darwin6.5}.
@item MacOS X 10.4 ``Tiger'' with Xcode 2.0
@cindex MacOS X
GCC 4.0 and GNU Make. @code{powerpc-apple-darwin8.0}.
@item Cross compiled to uClinux/uClibc on Motorola Coldfire
@cindex Motorola Coldfire
@cindex uClinux
@cindex uClibc
GCC 3.4 and GNU Make @code{m68k-uclinux-elf}.
@item Cross compiled to ARM using Glibc
@cindex ARM
GCC 2.95 and GNU Make @code{arm-linux}.
@item Cross compiled to Mingw32.
@cindex Windows
@cindex Microsoft
@cindex mingw32
GCC 3.4.4 and GNU Make @code{i586-mingw32msvc}.
@item OS/2
@cindex OS/2
@cindex IBM
GCC.
@end enumerate
If you use Libidn on, or port Libidn to, a new platform please report
it to the author.
@node Getting help
@section Getting help
A mailing list where users of Libidn may help each other exists, and
you can reach it by sending e-mail to @email{help-libidn@@gnu.org}.
Archives of the mailing list discussions, and an interface to manage
subscriptions, is available through the World Wide Web at
@url{http://lists.gnu.org/mailman/listinfo/help-libidn}.
@node Commercial Support
@section Commercial Support
Commercial support is available for users of GNU Libidn. The kind of
support that can be purchased may include:
@itemize
@item Implement new features.
Such as country code specific profiling to support a restricted subset
of Unicode.
@item Port Libidn to new platforms.
This could include porting Libidn to an embedded platforms that may
need memory or size optimization.
@item Integrating IDN support in your existing project.
@item System design of components related to IDN.
@end itemize
If you are interested, please write to:
@verbatim
Simon Josefsson Datakonsult AB
Hagagatan 24
113 47 Stockholm
Sweden
E-mail: simon@josefsson.org
@end verbatim
If your company provides support related to GNU Libidn and would like
to be mentioned here, contact the author (@pxref{Bug Reports}).
@node Downloading and Installing
@section Downloading and Installing
@cindex Installation
@cindex Download
The package can be downloaded from several places, including:
@url{ftp://alpha.gnu.org/pub/gnu/libidn/}
The latest version is stored in a file, e.g.,
@samp{libidn-@value{VERSION}.tar.gz} where the @samp{@value{VERSION}}
value is the highest version number in the directory.
The package is then extracted, configured and built like many other
packages that use Autoconf. For detailed information on configuring
and building it, refer to the @file{INSTALL} file that is part of the
distribution archive.
Here is an example terminal session that download, configure, build
and install the package. You will need a few basic tools, such as
@samp{sh}, @samp{make} and @samp{cc}.
@example
$ wget -q ftp://alpha.gnu.org/pub/gnu/libidn/libidn-@value{VERSION}.tar.gz
$ tar xfz libidn-@value{VERSION}.tar.gz
$ cd libidn-@value{VERSION}/
$ ./configure
...
$ make
...
$ make install
...
@end example
After that Libidn should be properly installed and ready for use.
A few @code{configure} options may be relevant, summarized in the
table.
@table @code
@item --enable-java
Build the Java port into a *.JAR file. @xref{Java API}, for more
information.
@item --disable-tld
Disable the TLD module. This would typically only be useful if you
are building on a memory restricted platforms. @xref{TLD Functions},
for more information.
@item --enable-csharp[=IMPL]
Build the @code{C#} port into a @code{*.DLL} file. @xref{C# API}, for
more information. Here, @code{IMPL} is @code{pnet} or @code{mono},
indicating whether the PNET @command{cscc} compiler or the Mono
@command{mcs} compiler should be used, respectively.
@item --disable-valgrind-tests
Disable running the self-checks under Valgrind
(@url{http://valgrind.org/}). Normally Valgrind does not cause
problems and can detect some severe memory errors. If you are getting
errors from Valgrind that are caused by the compiler or libc (possibly
as a result of special optimization flags), you may use this option to
disable the use of Valgrind.
@end table
For the complete list, refer to the output from @code{configure
--help}.
@menu
* Installing under Windows:: Windows specific build instructions.
@end menu
@node Installing under Windows
@subsection Installing under Windows
There are two ways to build Libidn on Windows: via MinGW or via Visual
Studio.
With MinGW, you can build a Libidn DLL and use it from other
applications. After installing MinGW (@url{http://mingw.org/}) follow
the generic installation instructions (@pxref{Downloading and
Installing}). The DLL is installed by default.
For information on how to use the DLL in other applications, see:
@url{http://www.mingw.org/mingwfaq.shtml#faq-msvcdll}.
You can build Libidn as a native Visual Studio C++ project. This
allows you to build the code for other platforms that VS supports,
such as Windows Mobile. You need Visual Studio 2005 or later.
First download and unpack the archive as described in the generic
installation instructions (@pxref{Downloading and Installing}). Don't
run @code{./configure}. Instead, start Visual Studio and open the
project file @file{win32/libidn.sln} inside the Libidn directory. You
should be able to build the project using Build Project.
Output libraries will be written into the @code{win32/lib} (or
@code{win32/lib/debug} for Debug versions) folder.
When working with Windows you may want to look into the special memory
handling functions that may be needed (@pxref{Memory handling under
Windows}).
@node Bug Reports
@section Bug Reports
@cindex Reporting Bugs
If you think you have found a bug in Libidn, please investigate it and
report it.
@itemize @bullet
@item Please make sure that the bug is really in Libidn, and
preferably also check that it hasn't already been fixed in the latest
version.
@item You have to send us a test case that makes it possible for us to
reproduce the bug.
@item You also have to explain what is wrong; if you get a crash, or
if the results printed are not good and in that case, in what way.
Make sure that the bug report includes all information you would need
to fix this kind of bug for someone else.
@end itemize
Please make an effort to produce a self-contained report, with
something definite that can be tested or debugged. Vague queries or
piecemeal messages are difficult to act on and don't help the
development effort.
If your bug report is good, we will do our best to help you to get a
corrected version of the software; if the bug report is poor, we won't
do anything about it (apart from asking you to send better bug
reports).
If you think something in this manual is unclear, or downright
incorrect, or if the language needs to be improved, please also send a
note.
Send your bug report to:
@center @samp{bug-libidn@@gnu.org}
@node Contributing
@section Contributing
@cindex Contributing
@cindex Hacking
If you want to submit a patch for inclusion -- from solve a typo you
discovered, up to adding support for a new feature -- you should
submit it as a bug report (@pxref{Bug Reports}). There are some
things that you can do to increase the chances for it to be included
in the official package.
Unless your patch is very small (say, under 10 lines) we require that
you assign the copyright of your work to the Free Software Foundation.
This is to protect the freedom of the project. If you have not
already signed papers, we will send you the necessary information when
you submit your contribution.
For contributions that doesn't consist of actual programming code, the
only guidelines are common sense. Use it.
For code contributions, a number of style guides will help you:
@itemize @bullet
@item Coding Style.
Follow the GNU Standards document (@pxref{top, GNU Coding Standards,,
standards}).
If you normally code using another coding standard, there is no
problem, but you should use @samp{indent} to reformat the code
(@pxref{top, GNU Indent,, indent}) before submitting your work.
@item Use the unified diff format @samp{diff -u}.
@item Return errors.
No reason whatsoever should abort the execution of the library. Even
memory allocation errors, e.g. when malloc return NULL, should work
although result in an error code.
@item Design with thread safety in mind.
Don't use global variables and the like.
@item Avoid using the C math library.
It causes problems for embedded implementations, and in most
situations it is very easy to avoid using it.
@item Document your functions.
Use comments before each function headers, that, if properly
formatted, are extracted into GTK-DOC web pages. Don't forget to
update the Texinfo manual as well.
@item Supply a ChangeLog and NEWS entries, where appropriate.
@end itemize
@c **********************************************************
@c ******************* Preparation ************************
@c **********************************************************
@node Preparation
@chapter Preparation
To use `Libidn', you have to perform some changes to your sources and
the build system. The necessary changes are small and explained in
the following sections. At the end of this chapter, it is described
how the library is initialized, and how the requirements of the
library are verified.
A faster way to find out how to adapt your application for use with
`Libidn' may be to look at the examples at the end of this manual
(@pxref{Examples}).
@menu
* Header::
* Initialization::
* Version Check::
* Building the source::
* Autoconf tests::
* Memory handling under Windows::
@end menu
@node Header
@section Header
The library contains a few independent parts, and each part export the
interfaces (data types and functions) in a header file. You must
include the appropriate header files in all programs using the
library, either directly or through some other header file, like this:
@example
#include <stringprep.h>
@end example
The header files and the functions they define are categorized as
follows:
@table @asis
@item stringprep.h
The low-level stringprep API entry point. For IDN applications, this
is usually invoked via IDNA. Some applications, specifically non-IDN
ones, may want to prepare strings directly though, and should include
this header file.
The name space of the stringprep part of Libidn is @code{stringprep*}
for function names, @code{Stringprep*} for data types and
@code{STRINGPREP_*} for other symbols. In addition,
@code{_stringprep*} is reserved for internal use and should never be
used by applications.
@item punycode.h
The entry point to Punycode encoding and decoding functions. Normally
punycode is used via the idna.h interface, but some application may
want to perform raw punycode operations.
The name space of the punycode part of Libidn is @code{punycode_*} for
function names, @code{Punycode*} for data types and @code{PUNYCODE_*}
for other symbols. In addition, @code{_punycode*} is reserved for
internal use and should never be used by applications.
@item idna.h
The entry point to the IDNA functions. This is the normal entry point
for applications that need IDN functionality.
The name space of the IDNA part of Libidn is @code{idna_*} for
function names, @code{Idna*} for data types and @code{IDNA_*} for
other symbols. In addition, @code{_idna*} is reserved for internal
use and should never be used by applications.
@item tld.h
The entry point to the TLD functions. Normal applications are not
expected to need this functionality, but it is present for
applications that are used by TLDs to validate customer input.
The name space of the TLD part of Libidn is @code{tld_*} for function
names, @code{Tld_*} for data types and @code{TLD_*} for other symbols.
In addition, @code{_tld*} is reserved for internal use and should
never be used by applications.
@item pr29.h
The entry point to the PR29 functions. These functions are used to
detect ``problem sequences'' (@pxref{PR29 Functions}), mostly for use
in security critical applications.
The name space of the PR29 part of Libidn is @code{pr29_*} for
function names, @code{Pr29_*} for data types and @code{PR29_*} for
other symbols. In addition, @code{_pr29*} is reserved for internal
use and should never be used by applications.
@item idn-free.h
The entry point to the Windows memory de-allocation function
(@pxref{Memory handling under Windows}). It contains only one
function @code{idn_free}.
@end table
All header files defined and use the symbol @code{IDNAPI} to decorate
the API functions.
@node Initialization
@section Initialization
Libidn is stateless and does not need any initialization.
@node Version Check
@section Version Check
It is often desirable to check that the version of `Libidn' used is
indeed one which fits all requirements. Even with binary
compatibility new features may have been introduced but due to problem
with the dynamic linker an old version is actually used. So you may
want to check that the version is okay right after program startup.
@include texi/stringprep_check_version.texi
The normal way to use the function is to put something similar to the
following first in your @code{main}:
@example
if (!stringprep_check_version (STRINGPREP_VERSION))
@{
printf ("stringprep_check_version() failed:\n"
"Header file incompatible with shared library.\n");
exit(EXIT_FAILURE);
@}
@end example
@node Building the source
@section Building the source
@cindex Compiling your application
If you want to compile a source file including e.g. the `idna.h' header
file, you must make sure that the compiler can find it in the
directory hierarchy. This is accomplished by adding the path to the
directory in which the header file is located to the compilers include
file search path (via the @option{-I} option).
However, the path to the include file is determined at the time the
source is configured. To solve this problem, `Libidn' uses the
external package @command{pkg-config} that knows the path to the
include file and other configuration options. The options that need
to be added to the compiler invocation at compile time are output by
the @option{--cflags} option to @command{pkg-config libidn}. The
following example shows how it can be used at the command line:
@example
gcc -c foo.c `pkg-config libidn --cflags`
@end example
Adding the output of @samp{pkg-config libidn --cflags} to the
compilers command line will ensure that the compiler can find e.g. the
idna.h header file.
A similar problem occurs when linking the program with the library.
Again, the compiler has to find the library files. For this to work,
the path to the library files has to be added to the library search
path (via the @option{-L} option). For this, the option
@option{--libs} to @command{pkg-config libidn} can be used. For
convenience, this option also outputs all other options that are
required to link the program with the `libidn' library. The example
shows how to link @file{foo.o} with the `libidn' library to a program
@command{foo}.
@example
gcc -o foo foo.o `pkg-config libidn --libs`
@end example
Of course you can also combine both examples to a single command by
specifying both options to @command{pkg-config}:
@example
gcc -o foo foo.c `pkg-config libidn --cflags --libs`
@end example
@node Autoconf tests
@section Autoconf tests
@cindex Autoconf tests
@cindex Configure tests
If your project uses Autoconf (@pxref{top, GNU Autoconf,, autoconf})
to check for installed libraries, you might find the following snippet
illustrative. It add a new @file{configure} parameter
@code{--with-libidn}, and check for @file{idna.h} and @samp{-lidn}
(possibly below the directory specified as the optional argument to
@code{--with-libidn}), and define the CPP symbol @code{LIBIDN} if the
library is found. The default behaviour is to search for the library
and enable the functionality (that is, define the symbol) when the
library is found, but if you wish to make the default behaviour of
your package be that Libidn is not used (even if it is installed on
the system), change @samp{libidn=yes} to @samp{libidn=no} on the third
line.
@example
AC_ARG_WITH(libidn, AC_HELP_STRING([--with-libidn=[DIR]],
[Support IDN (needs GNU Libidn)]),
libidn=$withval, libidn=yes)
if test "$libidn" != "no"; then
if test "$libidn" != "yes"; then
LDFLAGS="$@{LDFLAGS@} -L$libidn/lib"
CPPFLAGS="$@{CPPFLAGS@} -I$libidn/include"
fi
AC_CHECK_HEADER(idna.h,
AC_CHECK_LIB(idn, stringprep_check_version,
[libidn=yes LIBS="$@{LIBS@} -lidn"], libidn=no),
libidn=no)
fi
if test "$libidn" != "no" ; then
AC_DEFINE(LIBIDN, 1, [Define to 1 if you want IDN support.])
else
AC_MSG_WARN([Libidn not found])
fi
AC_MSG_CHECKING([if Libidn should be used])
AC_MSG_RESULT($libidn)
@end example
If you require that your users have installed @code{pkg-config} (which
I cannot recommend generally), the above can be done more easily as
follows.
@example
AC_ARG_WITH(libidn, AC_HELP_STRING([--with-libidn=[DIR]],
[Support IDN (needs GNU Libidn)]),
libidn=$withval, libidn=yes)
if test "$libidn" != "no" ; then
PKG_CHECK_MODULES(LIBIDN, libidn >= 0.0.0, [libidn=yes], [libidn=no])
if test "$libidn" != "yes" ; then
libidn=no
AC_MSG_WARN([Libidn not found])
else
libidn=yes
AC_DEFINE(LIBIDN, 1, [Define to 1 if you want Libidn.])
fi
fi
AC_MSG_CHECKING([if Libidn should be used])
AC_MSG_RESULT($libidn)
@end example
@node Memory handling under Windows
@section Memory handling under Windows
@cindex free
@cindex Memory handling
@cindex de-allocation
@cindex heap memory
Several functions in the library allocates memory. The memory is
expected to be de-allocated using the @code{free} function. Under
Windows, it is sometimes necessary to de-allocate memory in the same
module that allocated a memory region. The reason is that different
modules use separate heap memory regions. To solve this problem we
provide a function to de-allocate memory inside the library.
Note that we do not recommend using this interface generally if you do
not care about Windows portability.
@section Header file @code{idn-free.h}
To use the function explained in this chapter, you need to include the
file @file{idn-free.h} using:
@example
#include <idn-free.h>
@end example
@section Memory de-allocation function
@include texi/idn_free.texi
@c **********************************************************
@c ******************** Utility Functions ******************
@c **********************************************************
@node Utility Functions
@chapter Utility Functions
@cindex Utility Functions
The rest of this library makes extensive use of Unicode characters.
In order to interface this library with the outside world, your
application may need to make various Unicode transformations.
@section Header file @code{stringprep.h}
To use the functions explained in this chapter, you need to include
the file @file{stringprep.h} using:
@example
#include <stringprep.h>
@end example
@section Unicode Encoding Transformation
@include texi/stringprep_unichar_to_utf8.texi
@include texi/stringprep_utf8_to_unichar.texi
@include texi/stringprep_ucs4_to_utf8.texi
@include texi/stringprep_utf8_to_ucs4.texi
@section Unicode Normalization
@include texi/stringprep_ucs4_nfkc_normalize.texi
@include texi/stringprep_utf8_nfkc_normalize.texi
@section Character Set Conversion
@include texi/stringprep_locale_charset.texi
@include texi/stringprep_convert.texi
@include texi/stringprep_locale_to_utf8.texi
@include texi/stringprep_utf8_to_locale.texi
@c **********************************************************
@c ****************** Stringprep Functions *****************
@c **********************************************************
@node Stringprep Functions
@chapter Stringprep Functions
@cindex Stringprep Functions
Stringprep describes a framework for preparing Unicode text strings in
order to increase the likelihood that string input and string
comparison work in ways that make sense for typical users throughout
the world. The stringprep protocol is useful for protocol identifier
values, company and personal names, internationalized domain names,
and other text strings.
@section Header file @code{stringprep.h}
To use the functions explained in this chapter, you need to include
the file @file{stringprep.h} using:
@example
#include <stringprep.h>
@end example
@section Defining A Stringprep Profile
Further types and structures are defined for applications that want to
specify their own stringprep profile. As these are fairly obscure,
and by necessity tied to the implementation, we do not document them
here. Look into the @file{stringprep.h} header file, and the
@file{profiles.c} source code for the details.
@section Control Flags
@deftypevr {Stringprep flags} {Stringprep_profile_flags} {STRINGPREP_NO_NFKC}
Disable the NFKC normalization, as well as selecting the non-NFKC case
folding tables. Usually the profile specifies BIDI and NFKC settings,
and applications should not override it unless in special situations.
@end deftypevr
@deftypevr {Stringprep flags} {Stringprep_profile_flags} {STRINGPREP_NO_BIDI}
Disable the BIDI step. Usually the profile specifies BIDI and NFKC
settings, and applications should not override it unless in special
situations.
@end deftypevr
@deftypevr {Stringprep flags} {Stringprep_profile_flags} {STRINGPREP_NO_UNASSIGNED}
Make the library return with an error if string contains unassigned
characters according to profile.
@end deftypevr
@section Core Functions
@include texi/stringprep_4i.texi
@include texi/stringprep_4zi.texi
@include texi/stringprep.texi
@include texi/stringprep_profile.texi
@section Error Handling
@include texi/stringprep_strerror.texi
@section Stringprep Profile Macros
@deftypefun {int} stringprep_nameprep_no_unassigned (char * @var{in}, int @var{maxlen})
@var{in}: input/ouput array with string to prepare.
@var{maxlen}: maximum length of input/output array.
Prepare the input UTF-8 string according to the nameprep profile. The
AllowUnassigned flag is false, use @code{stringprep_nameprep} for
true AllowUnassigned. Returns 0 iff successful, or an error code.
@end deftypefun
@deftypefun {int} stringprep_iscsi (char * @var{in}, int @var{maxlen})
@var{in}: input/ouput array with string to prepare.
@var{maxlen}: maximum length of input/output array.
Prepare the input UTF-8 string according to the draft iSCSI stringprep
profile. Returns 0 iff successful, or an error code.
@end deftypefun
@deftypefun {int} stringprep_plain (char * @var{in}, int @var{maxlen})
@var{in}: input/ouput array with string to prepare.
@var{maxlen}: maximum length of input/output array.
Prepare the input UTF-8 string according to the draft SASL ANONYMOUS
profile. Returns 0 iff successful, or an error code.
@end deftypefun
@deftypefun {int} stringprep_xmpp_nodeprep (char * @var{in}, int @var{maxlen})
@var{in}: input/ouput array with string to prepare.
@var{maxlen}: maximum length of input/output array.
Prepare the input UTF-8 string according to the draft XMPP node
identifier profile. Returns 0 iff successful, or an error code.
@end deftypefun
@deftypefun {int} stringprep_xmpp_resourceprep (char * @var{in}, int @var{maxlen})
@var{in}: input/ouput array with string to prepare.
@var{maxlen}: maximum length of input/output array.
Prepare the input UTF-8 string according to the draft XMPP resource
identifier profile. Returns 0 iff successful, or an error code.
@end deftypefun
@c **********************************************************
@c ******************* Punycode Functions ******************
@c **********************************************************
@node Punycode Functions
@chapter Punycode Functions
@cindex Punycode Functions
Punycode is a simple and efficient transfer encoding syntax designed
for use with Internationalized Domain Names in Applications. It
uniquely and reversibly transforms a Unicode string into an ASCII
string. ASCII characters in the Unicode string are represented
literally, and non-ASCII characters are represented by ASCII
characters that are allowed in host name labels (letters, digits, and
hyphens). A general algorithm called Bootstring allows a string of
basic code points to uniquely represent any string of code points
drawn from a larger set. Punycode is an instance of Bootstring that
uses particular parameter values, appropriate for IDNA.
@section Header file @code{punycode.h}
To use the functions explained in this chapter, you need to include
the file @file{punycode.h} using:
@example
#include <punycode.h>
@end example
@section Unicode Code Point Data Type
The punycode function uses a special type to denote Unicode code
points. It is guaranteed to always be a 32 bit unsigned integer.
@deftypevr {Punycode Unicode code point} uint32_t punycode_uint
A unsigned integer that hold Unicode code points.
@end deftypevr
@section Core Functions
Note that the current implementation will fail if the
@code{input_length} exceed 4294967295 (the size of
@code{punycode_uint}). This restriction may be removed in the future.
Meanwhile applications are encouraged to not depend on this problem,
and use @code{sizeof} to initialize @code{input_length} and
@code{output_length}.
The functions provided are the following two entry points:
@include texi/punycode_encode.texi
@include texi/punycode_decode.texi
@section Error Handling
@include texi/punycode_strerror.texi
@c **********************************************************
@c ********************* IDNA Functions *********************
@c **********************************************************
@node IDNA Functions
@chapter IDNA Functions
@cindex IDNA Functions
Until now, there has been no standard method for domain names to use
characters outside the ASCII repertoire. The IDNA document defines
internationalized domain names (IDNs) and a mechanism called IDNA for
handling them in a standard fashion. IDNs use characters drawn from a
large repertoire (Unicode), but IDNA allows the non-ASCII characters
to be represented using only the ASCII characters already allowed in
so-called host names today. This backward-compatible representation is
required in existing protocols like DNS, so that IDNs can be
introduced with no changes to the existing infrastructure. IDNA is
only meant for processing domain names, not free text.
@section Header file @code{idna.h}
To use the functions explained in this chapter, you need to include
the file @file{idna.h} using:
@example
#include <idna.h>
@end example
@section Control Flags
The IDNA @code{flags} parameter can take on the following values, or a
bit-wise inclusive or of any subset of the parameters:
@deftypevr {Return code} {Idna_flags} IDNA_ALLOW_UNASSIGNED
Allow unassigned Unicode code points.
@end deftypevr
@deftypevr {Return code} {Idna_flags} IDNA_USE_STD3_ASCII_RULES
Check output to make sure it is a STD3 conforming host name.
@end deftypevr
@section Prefix String
@deftypevr {Macro} {#define} IDNA_ACE_PREFIX
String with the official IDNA prefix, @code{xn--}.
@end deftypevr
@section Core Functions
The idea behind the IDNA function names are as follows: the
@code{idna_to_ascii_4i} and @code{idna_to_unicode_44i} functions are
the core IDNA primitives. The @code{4} indicate that the function
takes UCS-4 strings (i.e., Unicode code points encoded in a 32-bit
unsigned integer type) of the specified length. The @code{i} indicate
that the data is written ``inline'' into the buffer. This means the
caller is responsible for allocating (and de-allocating) the string,
and providing the library with the allocated length of the string.
The output length is written in the output length variable. The
remaining functions all contain the @code{z} indicator, which means
the strings are zero terminated. All output strings are allocated by
the library, and must be de-allocated by the caller. The @code{4}
indicator again means that the string is UCS-4, the @code{8} means the
strings are UTF-8 and the @code{l} indicator means the strings are
encoded in the encoding used by the current locale.
The functions provided are the following entry points:
@include texi/idna_to_ascii_4i.texi
@include texi/idna_to_unicode_44i.texi
@section Simplified ToASCII Interface
@include texi/idna_to_ascii_4z.texi
@include texi/idna_to_ascii_8z.texi
@include texi/idna_to_ascii_lz.texi
@section Simplified ToUnicode Interface
@include texi/idna_to_unicode_4z4z.texi
@include texi/idna_to_unicode_8z4z.texi
@include texi/idna_to_unicode_8z8z.texi
@include texi/idna_to_unicode_8zlz.texi
@include texi/idna_to_unicode_lzlz.texi
@section Error Handling
@include texi/idna_strerror.texi
@c **********************************************************
@c ********************** TLD Functions *********************
@c **********************************************************
@node TLD Functions
@chapter TLD Functions
@cindex TLD Functions
Organizations that manage some Top Level Domains (TLDs) have published
tables with characters they accept within the domain. The reason may
be to reduce complexity that come from using the full Unicode range,
and to protect themselves from future (backwards incompatible) changes
in the IDN or Unicode specifications. Libidn implement an
infrastructure for defining and checking strings against such tables.
Libidn also ship some tables from TLDs that we have managed to get
permission to use them from. Because these tables are even less
static than Unicode or StringPrep tables, it is likely that they will
be updated from time to time (even in backwards incompatible ways).
The Libidn interface provide a ``version'' field for each TLD table,
which can be compared for equality to guarantee the same operation
over time.
From a design point of view, you can regard the TLD tables for IDN as
the ``localization'' step that come after the ``internationalization''
step provided by the IETF standards.
The TLD functionality rely on up-to-date tables. The latest version
of Libidn aim to provide these, but tables with unclear copying
conditions, or generally experimental tables, are not included. Some
such tables can be found at @url{https://github.com/gnuthor/tldchk}.
@section Header file @code{tld.h}
To use the functions explained in this chapter, you need to include
the file @file{tld.h} using:
@example
#include <tld.h>
@end example
@c @section Data Types
@c
@c @deftp {Data type} {Tld_table_element} @var{start} @var{end}
@c @example
@c /* Interval of valid code points in the TLD. */
@c struct Tld_table_element
@c @{
@c uint32_t start; /* Start of range. */
@c uint32_t end; /* End of range, end == start if single. */
@c @};
@c typedef struct Tld_table_element Tld_table_element;
@c @end example
@c This @code{struct} contain the @var{start} and @var{end} positions
@c (inclusive) of a range. If the range is a single (i.e., starts and
@c ends in the same character), then set @var{end} to the same as
@c @var{start}. This structure is normally used as an array.
@c @end deftp
@c
@c @deftp {Data type} {Tld_table} @var{name} @var{version} @var{nvalid} @var{valid}
@c @example
@c /* List valid code points in a TLD. */
@c struct Tld_table
@c @{
@c char *name; /* TLD name, e.g., "no". */
@c char *version; /* Version string from TLD file. */
@c size_t nvalid; /* Number of entries in data. */
@c Tld_table_element *valid[]; /* Sorted array of valid code points. */
@c @};
@c typedef struct Tld_table Tld_table;
@c @end example
@c In this @code{struct}, the @var{name} field is a string (@samp{char*})
@c indicating the TLD name (e.g., ``no''). The @var{version} field is a
@c string (@samp{char*}) containing a free form humanly readable string
@c that can be used for equality comparison to compare different versions
@c of the table. The @var{nvalid} field indicate how many entries there
@c are in @var{valid}, which brings us finally to @var{valid} that
@c contain the actual code points that are valid for this TLD (see
@c @code{Tld_table_element} above).
@c @end deftp
@section Core Functions
@include texi/tld_check_4t.texi
@include texi/tld_check_4tz.texi
@section Utility Functions
@include texi/tld_get_4.texi
@include texi/tld_get_4z.texi
@include texi/tld_get_z.texi
@include texi/tld_get_table.texi
@include texi/tld_default_table.texi
@section High-Level Wrapper Functions
@include texi/tld_check_4.texi
@include texi/tld_check_4z.texi
@include texi/tld_check_8z.texi
@include texi/tld_check_lz.texi
@section Error Handling
@include texi/tld_strerror.texi
@c **********************************************************
@c ********************** PR29 Functions ********************
@c **********************************************************
@node PR29 Functions
@chapter PR29 Functions
@cindex PR29 Functions
A deficiency in the specification of Unicode Normalization Forms has
been found. The consequence is that some strings can be normalized
into different strings by different implementations. In other words,
two different implementations may return different output for the same
input (because the interpretation of the specification is
ambiguous). Further, an implementation invoked again on the one of the
output strings may return a different string (because one of the
interpretation of the ambiguous specification make normalization
non-idempotent). Fortunately, only a select few character sequence
exhibit this problem, and none of them are expected to occur in
natural languages (due to different linguistic uses of the involved
characters).
A full discussion of the problem may be found at:
@url{http://www.unicode.org/review/pr-29.html}
The PR29 functions below allow you to detect the problem sequence. So
when would you want to use these functions? For most applications,
such as those using Nameprep for IDN, this is likely only to be an
interoperability problem. Thus, you may not want to care about it, as
the character sequences will rarely occur naturally. However, if you
are using a profile, such as SASLPrep, to process authentication
tokens; authorization tokens; or passwords, there is a real danger
that attackers may try to use the peculiarities in these strings to
attack parts of your system. As only a small number of strings, and
no naturally occurring strings, exhibit this problem, the conservative
approach of rejecting the strings is recommended. If this approach is
not used, you should instead verify that all parts of your system,
that process the tokens and passwords, use a NFKC implementation that
produce the same output for the same input.
Technically inclined readers may be interested in knowing more about
the implementation aspects of the PR29 flaw. @xref{PR29 discussion}.
@section Header file @code{pr29.h}
To use the functions explained in this chapter, you need to include
the file @file{pr29.h} using:
@example
#include <pr29.h>
@end example
@section Core Functions
@include texi/pr29_4.texi
@section Utility Functions
@include texi/pr29_4z.texi
@include texi/pr29_8z.texi
@section Error Handling
@include texi/pr29_strerror.texi
@c **********************************************************
@c *********************** Examples ***********************
@c **********************************************************
@node Examples
@chapter Examples
@cindex Examples
This chapter contains example code which illustrate how `Libidn' can
be used when writing your own application.
@menu
* Example 1:: Example using stringprep.
* Example 2:: Example using punycode.
* Example 3:: Example using IDNA ToASCII.
* Example 4:: Example using IDNA ToUnicode.
* Example 5:: Example using TLD checking.
@end menu
@node Example 1
@section Example 1
This example demonstrates how the stringprep functions are used.
@verbatiminclude example.c
@node Example 2
@section Example 2
This example demonstrates how the punycode functions are used.
@verbatiminclude example2.c
@node Example 3
@section Example 3
This example demonstrates how the library is used to convert
internationalized domain names into ASCII compatible names.
@verbatiminclude example3.c
@node Example 4
@section Example 4
This example demonstrates how the library is used to convert ASCII
compatible names to internationalized domain names.
@verbatiminclude example4.c
@node Example 5
@section Example 5
This example demonstrates how the library is used to check a string
for invalid characters within a specific TLD.
@verbatiminclude example5.c
@c **********************************************************
@c ********************* Invoking idn *********************
@c **********************************************************
@node Invoking idn
@chapter Invoking idn
@pindex idn
@cindex invoking @command{idn}
@cindex command line
@section Name
GNU Libidn (idn) -- Internationalized Domain Names command line tool
@section Description
@code{idn} allows internationalized string preparation
(@samp{stringprep}), encoding and decoding of punycode data, and IDNA
ToASCII/ToUnicode operations to be performed on the command line.
If strings are specified on the command line, they are used as input
and the computed output is printed to standard output @code{stdout}.
If no strings are specified on the command line, the program read
data, line by line, from the standard input @code{stdin}, and print
the computed output to standard output. What processing is performed
(e.g., ToASCII, or Punycode encode) is indicated by options. If any
errors are encountered, the execution of the applications is aborted.
All strings are expected to be encoded in the preferred charset used
by your locale. Use @code{--debug} to find out what this charset is.
You can override the charset used by setting environment variable
@code{CHARSET}.
To process a string that starts with @code{-}, for example
@code{-foo}, use @code{--} to signal the end of parameters, as in
@code{idn --quiet -a -- -foo}.
@section Options
@code{idn} recognizes these commands:
@verbatim
-h, --help Print help and exit
-V, --version Print version and exit
-s, --stringprep Prepare string according to nameprep profile
-d, --punycode-decode Decode Punycode
-e, --punycode-encode Encode Punycode
-a, --idna-to-ascii Convert to ACE according to IDNA (default mode)
-u, --idna-to-unicode Convert from ACE according to IDNA
--allow-unassigned Toggle IDNA AllowUnassigned flag (default off)
--usestd3asciirules Toggle IDNA UseSTD3ASCIIRules flag (default off)
--no-tld Don't check string for TLD specific rules
Only for --idna-to-ascii and --idna-to-unicode
-n, --nfkc Normalize string according to Unicode v3.2 NFKC
-p, --profile=STRING Use specified stringprep profile instead
Valid stringprep profiles: `Nameprep',
`iSCSI', `Nodeprep', `Resourceprep',
`trace', `SASLprep'
--debug Print debugging information
--quiet Silent operation
@end verbatim
@section Environment Variables
The @var{CHARSET} environment variable can be used to override what
character set to be used for decoding incoming data (i.e., on the
command line or on the standard input stream), and to encode data to
the standard output. If your system is set up correctly, however, the
application will guess which character set is used automatically.
Example usage:
@example
$ CHARSET=ISO-8859-1 idn --punycode-encode
...
@end example
@section Examples
Standard usage, reading input from standard input:
@example
jas@@latte:~$ idn
libidn 0.3.5
Copyright 2002, 2003 Simon Josefsson.
GNU Libidn comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Libidn under the terms of
the GNU Lesser General Public License. For more information
about these matters, see the file named COPYING.LIB.
Type each input string on a line by itself, terminated by a newline character.
r@"aksm@"org@aa{}s.se
xn--rksmrgs-5wao1o.se
jas@@latte:~$
@end example
Reading input from command line, and disabling copyright and license
information:
@example
jas@@latte:~$ idn --quiet r@"aksm@"org@aa{}s.se bl@aa{}b@ae{}rgr@o{}d.no
xn--rksmrgs-5wao1o.se
xn--blbrgrd-fxak7p.no
jas@@latte:~$
@end example
Accessing a specific StringPrep profile directly:
@example
jas@@latte:~$ idn --quiet --profile=SASLprep --stringprep te@ss{}t@ordf{}
te@ss{}ta
jas@@latte:~$
@end example
@section Troubleshooting
Getting character data encoded right, and making sure Libidn use the
same encoding, can be difficult. The reason for this is that most
systems encode character data in more than one character encoding,
i.e., using @code{UTF-8} together with @code{ISO-8859-1} or
@code{ISO-2022-JP}. This problem is likely to continue to exist until
only one character encoding come out as the evolutionary winner, or
(more likely, at least to some extents) forever.
The first step to troubleshooting character encoding problems with
Libidn is to use the @samp{--debug} parameter to find out which
character set encoding @samp{idn} believe your locale uses.
@example
jas@@latte:~$ idn --debug --quiet ""
system locale uses charset `UTF-8'.
jas@@latte:~$
@end example
If it prints @code{ANSI_X3.4-1968} (i.e., @code{US-ASCII}), this
indicate you have not configured your locale properly. To configure
the locale, you can, for example, use @samp{LANG=sv_SE.UTF-8; export
LANG} at a @code{/bin/sh} prompt, to set up your locale for a Swedish
environment using @code{UTF-8} as the encoding.
Sometimes @samp{idn} appear to be unable to translate from your system
locale into @code{UTF-8} (which is used internally), and you get an
error like the following:
@example
jas@@latte:~$ idn --quiet foo
idn: could not convert from ISO-8859-1 to UTF-8.
jas@@latte:~$
@end example
The simplest explanation is that you haven't installed the
@samp{iconv} conversion tools. You can find it as a standalone
library in GNU Libiconv
(@uref{http://www.gnu.org/software/libiconv/}). On many GNU/Linux
systems, this library is part of the system, but you may have to
install additional packages (e.g., @samp{glibc-locale} for Debian) to
be able to use it.
Another explanation is that the error is correct and you are feeding
@samp{idn} invalid data. This can happen inadvertently if you are not
careful with the character set encoding you use. For example, if your
shell run in a @code{ISO-8859-1} environment, and you invoke
@samp{idn} with the @samp{CHARSET} environment variable as follows,
you will feed it @code{ISO-8859-1} characters but force it to believe
they are @code{UTF-8}. Naturally this will lead to an error, unless
the byte sequences happen to be valid @code{UTF-8}. Note that even if
you don't get an error, the output may be incorrect in this situation,
because @code{ISO-8859-1} and @code{UTF-8} does not in general encode
the same characters as the same byte sequences.
@example
jas@@latte:~$ idn --quiet --debug ""
system locale uses charset `ISO-8859-1'.
jas@@latte:~$ CHARSET=UTF-8 idn --quiet --debug r@"aksm@"org@aa{}s
system locale uses charset `UTF-8'.
input[0] = U+0072
input[1] = U+4af3
input[2] = U+006d
input[3] = U+1b29e5
input[4] = U+0073
output[0] = U+0078
output[1] = U+006e
output[2] = U+002d
output[3] = U+002d
output[4] = U+0072
output[5] = U+006d
output[6] = U+0073
output[7] = U+002d
output[8] = U+0068
output[9] = U+0069
output[10] = U+0036
output[11] = U+0064
output[12] = U+0035
output[13] = U+0039
output[14] = U+0037
output[15] = U+0035
output[16] = U+0035
output[17] = U+0032
output[18] = U+0061
xn--rms-hi6d597552a
jas@@latte:~$
@end example
The sense moral here is to forget about @samp{CHARSET} (configure your
locales properly instead) unless you know what you are doing, and if
you want to use it, do it carefully, after verifying with
@samp{--debug} that you get the desired results.
@node Emacs API
@chapter Emacs API
Included in Libidn are @file{punycode.el} and @file{idna.el} that
provides an Emacs Lisp API to (a limited set of) the Libidn API. This
section describes the API. Currently the IDNA API always set the
@code{UseSTD3ASCIIRules} flag and clear the @code{AllowUnassigned}
flag, in the future there may be functionality to specify these flags
via the API.
@section Punycode Emacs API
@defvar punycode-program
Name of the GNU Libidn @file{idn} application. The default is
@samp{idn}. This variable can be customized.
@end defvar
@defvar punycode-environment
List of environment variable definitions prepended to
@samp{process-environment}. The default is @samp{("CHARSET=UTF-8")}.
This variable can be customized.
@end defvar
@defvar punycode-encode-parameters
List of parameters passed to @var{punycode-program} to invoke punycode
encoding mode. The default is @samp{("--quiet" "--punycode-encode")}.
This variable can be customized.
@end defvar
@defvar punycode-decode-parameters
Parameters passed to @var{punycode-program} to invoke punycode
decoding mode. The default is @samp{("--quiet" "--punycode-decode")}.
This variable can be customized.
@end defvar
@defun punycode-encode string
Returns a Punycode encoding of the @var{string}, after converting the
input into UTF-8.
@end defun
@defun punycode-decode string
Returns a possibly multibyte string which is the decoding of the
@var{string} which is a punycode encoded string.
@end defun
@section IDNA Emacs API
@defvar idna-program
Name of the GNU Libidn @file{idn} application. The default is
@samp{idn}. This variable can be customized.
@end defvar
@defvar idna-environment
List of environment variable definitions prepended to
@samp{process-environment}. The default is @samp{("CHARSET=UTF-8")}.
This variable can be customized.
@end defvar
@defvar idna-to-ascii-parameters
List of parameters passed to @var{idna-program} to invoke IDNA ToASCII
mode. The default is @samp{("--quiet" "--idna-to-ascii"
"--usestd3asciirules")}. This variable can be customized.
@end defvar
@defvar idna-to-unicode-parameters
Parameters passed @var{idna-program} to invoke IDNA ToUnicode mode.
The default is @samp{("--quiet" "--idna-to-unicode"
"--usestd3asciirules")}. This variable can be customized.
@end defvar
@defun idna-to-ascii string
Returns an ASCII Compatible Encoding (ACE) of the string computed by
the IDNA ToASCII operation on the input @var{string}, after converting
the input to UTF-8.
@end defun
@defun idna-to-unicode string
Returns a possibly multibyte string which is the output of the IDNA
ToUnicode operation computed on the input @var{string}.
@end defun
@node Java API
@chapter Java API
Libidn has been ported to the Java programming language, and as a
consequence most of the API is available to native Java applications.
This section contain notes on this support, complete documentation is
pending.
The Java library, if Libidn has been built with Java support
(@pxref{Downloading and Installing}), will be placed in
@file{java/libidn-@value{VERSION}.jar}. The source code is located in
@file{java/gnu/inet/encoding/}.
@section Overview
This package provides a Java implementation of the Internationalized
Domain Names in Applications (IDNA) standard. It is written entirely
in Java and does not require any additional libraries to be set up.
The gnu.inet.encoding.IDNA class offers two public functions, toASCII
and toUnicode which can be used as follows:
@example
gnu.inet.encoding.IDNA.toASCII("bl@"ods.z@"ug");
gnu.inet.encoding.IDNA.toUnicode("xn--blds-6qa.xn--zg-xka");
@end example
@section Miscellaneous Programs
The @file{misc/} directory contains several programs that are related
to the Java part of GNU Libidn, but that don't need to be included in
the main source tree.
@subsection GenerateRFC3454
This program parses RFC3454 and creates the RFC3454.java program that
is required during the StringPrep phase.
The RFC can be found at various locations, for example at
@url{http://www.ietf.org/rfc/rfc3454.txt}.
Invoke the program as follows:
@example
$ java GenerateRFC3454
Creating RFC3454.java... Ok.
@end example
@subsection GenerateNFKC
The GenerateNFKC program parses the Unicode character database file
and generates all the tables required for NFKC. This program requires
the two files UnicodeData.txt and CompositionExclusions.txt of version
3.2 of the Unicode files. Note that RFC3454 (Stringprep) defines that
Unicode version 3.2 is to be used, not the latest version.
The Unicode data files can be found at
@url{http://www.unicode.org/Public/}.
Invoke the program as follows:
@example
$ java GenerateNFKC
Creating CombiningClass.java... Ok.
Creating DecompositionKeys.java... Ok.
Creating DecompositionMappings.java... Ok.
Creating Composition.java... Ok.
@end example
@subsection TestIDNA
The TestIDNA program allows to test the IDNA implementation manually
or against Simon Josefsson's test vectors.
The test vectors can be found at the Libidn homepage,
@url{http://www.gnu.org/software/libidn/}.
To test the transformation manually, use:
@example
$ java -cp .:../libidn.jar TestIDNA -a <string to test>
Input: <string to test>
Output: <toASCII(string to test)>
$ java -cp .:../libidn.jar TestIDNA -u <string to test>
Input: <string to test>
Output: <toUnicode(string to test)>
@end example
To test against draft-josefsson-idn-test-vectors.html, use:
@example
$ java -cp .:../libidn.jar TestIDNA -t
No errors detected!
@end example
@subsection TestNFKC
The TestNFKC program allows to test the NFKC implementation manually
or against the NormalizationTest.txt file from the Unicode data files.
To test the normalization manually, use:
@example
$ java -cp .:../libidn.jar TestNFKC <string to test>
Input: <string to test>
Output: <nfkc version of the string to test>
@end example
To test against NormalizationTest.txt:
@example
$ java -cp .:../libidn.jar TestNFKC
No errors detected!
@end example
@section Possible Problems
Beware of Bugs: This Java API needs a lot more testing, especially
with "exotic" character sets. While it works for me, it may not work
for you.
Encoding of your Java sources: If you are using non-ASCII characters
in your Java source code, make sure javac compiles your programs with
the correct encoding. If necessary specify the encoding using the
-encoding parameter.
Java Unicode handling: Java 1.4 only handles 16-bit Unicode code
points (i.e. characters in the Basic Multilingual Plane), this
implementation therefore ignores all references to so-called
Supplementary Characters (U+10000 to U+10FFFF). Starting from Java
1.5, these characters will also be supported by Java, but this will
require changes to this library. See also the next section.
@section A Note on Java and Unicode
This library uses Java's built-in 'char' datatype. Up to Java 1.4, this
datatype only supports 16-bit Unicode code points, also called the
Basic Multilingual Plane. For this reason, this library doesn't work
for Supplementary Characters (i.e. characters from U+10000 to
U+10FFFF). All references to such characters are silently ignored.
Starting from Java 1.5, also Supplementary Characters will be
supported. However, this will require changes in the present version
of the library. Java 1.5 is currently in beta status.
For more information refer to the documentation of java.lang.Character
in the JDK API.
@node C# API
@chapter C# API
The Libidn library has been ported to the C# language. The port
reside in the top-level @file{csharp/} directory. Currently, no
further documentation about the implementation or the API is
available. However, the C# port was based on the Java port, and the
API is exactly the same as in the Java version. The help files for
the Java API may thus be useful.
@c **********************************************************
@c ******************* Acknowledgements *******************
@c **********************************************************
@node Acknowledgements
@chapter Acknowledgements
The punycode implementation was taken from the IETF IDN Punycode
specification, by Adam M. Costello. The TLD code was contributed by
Thomas Jacob. The Java implementation was contributed by Oliver Hitz.
The C# implementation was contributed by Alexander Gnauck. The
Unicode tables were provided by Unicode, Inc. Some functions for
dealing with Unicode (see nfkc.c and toutf8.c) were borrowed from
GLib, downloaded from @url{http://www.gtk.org/}. The manual borrowed
text from Libgcrypt by Werner Koch.
Inspiration for many things that, consciously or not, have gone into
this package is due to a number of free software package that the
author has been exposed to. The author wishes to acknowledge the free
software community in general, for giving an example on what role
software development can play in the modern society.
Several people reported bugs, sent patches or suggested improvements,
see the file THANKS in the top-level directory of the source code.
@c **********************************************************
@c ************************ History ***********************
@c **********************************************************
@node History
@chapter History
The complete history of user visible changes is stored in the file
@file{NEWS} in the top-level directory of the source code tree. The
complete history of modifications to each file is stored in the file
@file{ChangeLog} in the same directory. This section contain a
condensed version of that information, in the form of ``milestones''
for the project.
@table @asis
@item Stringprep implementation.
Version 0.0.0 released on 2002-11-05.
@item IDNA and Punycode implementations, part of the GNU project.
Version 0.1.0 released on 2003-01-05.
@item Uses official IDNA ACE prefix @code{xn--}.
Version 0.1.7 released on 2003-02-12.
@item Command line interface.
Version 0.1.11 released on 2003-02-26.
@item GNU Libc add-on proposed.
Version 0.1.12 released on 2003-03-06.
@item Interoperability testing during IDNConnect.
Version 0.3.1 released on 2003-10-02.
@item TLD restriction testing.
Version 0.4.0 released on 2004-02-28.
@item GNU Libc add-on integrated.
Version 0.4.1 released on 2004-03-08.
@item Native Java implementation.
Version 0.4.2-0.4.9 released between 2004-03-20 and 2004-06-11.
@item PR-29 functions for ``problem sequences''.
Version 0.5.0 released on 2004-06-26.
@item Many small portability fixes and wider use.
Version 0.5.1 through 0.5.20, released between 2004-07-09 and
2005-10-23.
@item Native C# implementation.
Version 0.6.0 released on 2005-12-03.
@item Windows support through cross-compilation.
Version 0.6.1 released on 2006-01-20.
@item Library declared stable by releasing v1.0.
Version 1.0 released on 2007-07-31.
@end table
@node PR29 discussion
@appendix PR29 discussion
If you wish to experiment with a modified Unicode NFKC implementation
according to the PR29 proposal, you may find the following bug report
useful. However, I have not verified that the suggested modifications
are correct. For reference, I'm including my response to the report
as well.
@verbatim
From: Rick McGowan <rick@unicode.org>
Subject: Possible bug and status of PR 29 change(s)
To: bug-libidn@gnu.org
Date: Wed, 27 Oct 2004 14:49:17 -0700
Hello. On behalf of the Unicode Consortium editorial committee, I would
like to find out more information about the PR 29 fixes, if any, and
functions in Libidn. Your implementation was listed in the text of PR29 as
needing investigation, so I am following up on several implementations.
The UTC has accepted the proposed fix to D2 as outlined in PR29, and a new
draft of UAX #15 has been issued.
I have looked at Libidn 0.5.8 (today), and there may still be a possible
bug in NFKC.java and nfkc.c.
------------------------------------------------------
1. In NFKC.java, this line in canonicalOrdering():
if (i > 0 && (last_cc == 0 || last_cc != cc)) {
should perhaps be changed to:
if (i > 0 && (last_cc == 0 || last_cc < cc)) {
but I'm not sure of the sense of this comparison.
------------------------------------------------------
2. In nfkc.c, function _g_utf8_normalize_wc() has this code:
if (i > 0 &&
(last_cc == 0 || last_cc != cc) &&
combine (wc_buffer[last_start], wc_buffer[i],
&wc_buffer[last_start]))
{
This appears to have the same bug as the current Python implementation (in
Python 2.3.4). The code should be checking, as per new rule D2 UAX #15
update, that the next combining character is the same or HIGHER than the
current one. It now checks to see if it's non-zero and not equal.
The above line(s) should perhaps be changed to:
if (i > 0 &&
(last_cc == 0 || last_cc < cc) &&
combine (wc_buffer[last_start], wc_buffer[i],
&wc_buffer[last_start]))
{
but I'm not sure of the sense of the comparison (< or > or <=?) here.
In the text of PR29, I will be marking Libidn as "needs change" and adding
the version number that I checked. If any further change is made, please
let me know the release version, and I'll update again.
Regards,
Rick McGowan
@end verbatim
@verbatim
From: Simon Josefsson <jas@extundo.com>
Subject: Re: Possible bug and status of PR 29 change(s)
To: Rick McGowan <rick@unicode.org>
Cc: bug-libidn@gnu.org
Date: Thu, 28 Oct 2004 09:47:47 +0200
Rick McGowan <rick@unicode.org> writes:
> Hello. On behalf of the Unicode Consortium editorial committee, I would
> like to find out more information about the PR 29 fixes, if any, and
> functions in Libidn. Your implementation was listed in the text of PR29 as
> needing investigation, so I am following up on several implementations.
>
> The UTC has accepted the proposed fix to D2 as outlined in PR29, and a new
> draft of UAX #15 has been issued.
>
> I have looked at Libidn 0.5.8 (today), and there may still be a possible
> bug in NFKC.java and nfkc.c.
Hello Rick.
I believe the current behavior is intentional. Libidn do not aim to
implement latest-and-greatest NFKC, it aim to implement the NFKC
functionality required for StringPrep and IDN. As you may know,
StringPrep/IDN reference Unicode 3.2.0, and explicitly says any later
changes (which I consider PR29 as) do not apply.
In fact, I believe that would I incorporate the changes suggested in
PR29, I would in fact be violating the IDN specifications.
Thanks for looking into the code and finding the place where the
change could be made. I'll see if I can mention this in the manual
somewhere, for technically interested readers.
Regards,
Simon
@end verbatim
@node On Label Separators
@appendix On Label Separators
Some strings contains characters whose NFKC normalized form contain
the ASCII dot (0x2E, ``.''). Examples of these characters are U+2024
(ONE DOT LEADER) and U+248C (DIGIT FIVE FULL STOP). The strings have
the interesting property that their IDNA ToASCII output will contain
embedded dots. For example:
@example
ToASCII (hi U+248C com) = hi5.com
ToASCII (r@"aksm@"org@aa{}s U+2024 com) = xn--rksmrgs.com-l8as9u
@end example
This demonstrate the two general cases: The first where the ASCII dot
is part of an output that do not begin with the IDN prefix
@code{xn--}. The second example illustrate when the dot is part of
IDN prefixed with @code{xn--}.
The input strings are, from the DNS point of view, a single label.
The IDNA algorithm translate one label at a time. Thus, the output is
expected to be only one label. What is important here is to make sure
the DNS resolver receives the correct query. The DNS protocol does
not use the dot to delimit labels on the wire, rather it uses
length-value pairs. Thus the correct query would be for
@code{@{7@}hi5.com} and @code{@{22@}xn--rksmrgs.com-l8as9u}
respectively.
Some implementations @footnote{Notably Microsoft's Internet Explorer
and Mozilla's Firefox, but not Apple's Safari.} have decided that
these inputs strings are potentially confusing for the user. The
string @code{hi U+248C com} looks like @code{hi5.com} on systems that
support Unicode properly. These implementations do not follow RFC
3490. They yield:
@example
ToASCII (hi U+248C com) = hi5.com
ToASCII (r@"aksm@"org@aa{}s U+2024 com) = xn--rksmrgs-5wao1o.com
@end example
The DNS query they perform are @code{@{3@}hi5@{3@}com} and
@code{@{18@}xn--rksmrgs-5wao1o@{3@}com} respectively. Arguably, this
leads to a better user experience, and suggests that the IDNA
specification is sub-optimal in this area.
@section Recommended Workaround
It has been suggested to normalize the entire input string using NFKC
before passing it to IDNA ToASCII. You may use
@code{stringprep_utf8_nfkc_normalize} or
@code{stringprep_ucs4_nfkc_normalize}. This appears to lead to
similar behaviour as IE/Firefox, which would avoid the problem, but
this needs to be confirmed. Feel free to discuss the issue with us.
Alternative workarounds are being considered. Eventually Libidn may
implement a new flag to the @code{idna_*} functions that implements a
recommended way to work around this problem.
@node Copying Information
@appendix Copying Information
@menu
* GNU Free Documentation License:: License for copying this manual.
* GNU LGPL:: License for copying the library.
* GNU GPL:: License for copying the programs.
@end menu
@node GNU Free Documentation License
@appendixsec GNU Free Documentation License
@cindex FDL, GNU Free Documentation License
@include fdl-1.3.texi
@node GNU LGPL
@appendixsec GNU Lesser General Public License
@cindex LGPL, GNU Lesser General Public License
@cindex License, GNU LGPL
@include lgpl-2.1.texi
@node GNU GPL
@appendixsec GNU General Public License
@cindex GPL, GNU General Public License
@cindex License, GNU GPL
@include gpl-3.0.texi
@node Function and Variable Index
@unnumbered Function and Variable Index
@printindex fn
@node Concept Index
@unnumbered Concept Index
@printindex cp
@bye
@c LocalWords: Kerberos Shishi getaddrinfo Slackware Cygwin WorkShop
|