Home / Paul's FAQs / XML Newbie FAQ

Paul's Newbie XML FAQ

Last Updated: February, 2007

The following FAQ discusses everything I had to figure out as I was learning XML. I hope you find this list useful! If you do, please let me know.

PART 1: Basic Questions

Q: Where can I learn about XML?

Q: Iím getting lost. Can you give me a quick glossary?

Q: DTD, W3C Schemas - whatís the deal? Which one should I use?

Q: Why do I need a DTD or a schema?

Q: What are the performance costs of using XML for my data? With or without validation?

Q: What do you mean by Ďvalidí ? How can a schema tell if my XML document is valid?

Q: What's the deal: Do attributes have to have values? Must they always be surrounded by double-quotes?

Q: I keep getting this stupid error in my XML validator: Invalid xml declaration. Source: '<?xml version="1.0"?>' Line: 1, Pos: 22. But I know that

Q: The line numbers and positions in my error messages are all whacky. No only does my XML look fine where it's indicated, but also the character positions are in strange places. What's the deal?

Q: Whatís a validator? Where can I get a free one?

Q: Do you have any information on parser performance?

Q: I see all of these "http://server/namespace" URLs in my documents, but when I go to the URL thereís nothing there! Why not?

Q: Where can I go to get more information on XML namespaces?

PART 2: DTDs

Q: Whatís a 10-second summary of DTDs?

Q: How do I create a DTD where sub-elements can occur in any order?

Q: But doesn't that get kind of crazy when I have more than two sub-elements?

Q: What's the deal with these "Formal Public Identifiers" (FPIs)? You know, those things used in the <!DOCTYPE PUBLIC> declaration?

Q: But, shouldn't I register my Formal Public Identifier, somehow? How do I do that?

PART 3: W3C Schemas

Q: Whatís a 10-second summary of W3C Schemaís?

Q: Are there some basic rules to follow when writing W3C schemas? You know, to get me started on the right foot?

Q: Why can't I put an <element> directly inside of another <element> or <complexType> tag in my W3C Schema?

Q: Iíve declared an <attribute> that I want to reuse with <attribute ref=""> but itís not being recognized? Whatís the problem?

Q: Why is it allowing any child <element> or attribute to be specified? OR, why does it seem to be ignoring my schema?

Q: The validator says that the <choice> (or <sequence> or <all>) tag can not have the minOccurs tag. Whatís up with that?

Q: How can I specify that the child elements are allowed to occur in any order?

Q: Why is it that my <all minOccurs="0"> tag is actually requiring all of the child <element>s instead of allowing them to be optional?

Q: It keeps saying "Child element is not expected at this point" but I look at my schema and the tag is definitely there. Whatís the deal?

Q: How do I specify a document with infinitely nested elements? You know, like this:

Q: How would I allow one level of nesting, but not two?

Q: How would I implement an XML document which allows for infinite number of nested children, but where the root is a different <element> ? You know, like this:

Q: How can I have an element which has different children, depending on who its parent is?

Q: Suppose I have a recursively nested element. How can I specify that it should have certain children based on who it's parent (or recursive grandparent) is?

Q: How do I specify an attribute that has an enumerated list of possible values? This used to be so easy in the old DTD format!

Q: How can I specify that an attribute can have a list of white-space separated tokens?

Q: I keep getting this error message: "Unique Particle Attribution". During validation against this schema, it says "ambiguity would be created for those two particles". What is is trying to tell me?

Q: Hey, I just realized that my document root element is misspelled. And yet my document is supposedly "valid". How could that be?

Q: When I run my validity check with the W3 validator ( ), I get these strange warning messages:" Warning: allowing {**mynamespace**}:myattr as child because it matched wildcard(##any)" Why?

PART 4: The Xerces SAX Parsers

Q: Why can't my compiler can't seem to find the most basic SAX classes, like DefaultHandler? I know I'm including the correct include files.

Q: How do I remove the duplicate attributes reported by the xerces SAX2 parser?

Q: Why, when I use Xerces C++, does my program crash on exit?

Q: Do you have an example of parsing an XML document from memory?

Q: How can I get my Xerces SAX-1 Parser to recognize namespaces or my W3C XML Schema? It seems like it only understands DTDs.

PART 1: Basic Questions

Q: Where can I learn about XML?

Buy a book (I like "Beginning XML, 3rd Edition" from Wrox / Wiley Publishing"), or go to: http://www.xmlfiles.com/ .

I also like http://www.xml.com/pub/a/2001/06/06/schemasimple.html and http://www.rpbourret.com/xml/NamespacesFAQ.htm .

Do not just read the XML standards. Forget it! Life is too short to try and decipher them.

Q: Iím getting lost. Can you give me a quick glossary?

Q: DTD, W3C Schemas - whatís the deal? Which one should I use?

Originally, I was very strongly recommending W3C schemas, but now I'm not so sure. I think there are still lots of cases where simple DTD's are enough. And further, there are probably lots of cases where you want to do all your own validation inside your own source code program.

All of these things, DTDs, Schemas, and RELAX-NG are methods for determining if your document is "valid" - meaning that it has the right <element>s, in the right order, with the right attributes, and the correct types of data in them.

DTDs are much simpler to understand and learn. It is easier to create a simple validator with a DTD and it will check most of what's most important in your document.

W3C schemas are vastly more flexible and are your only choice if you want validation with namespaces and data-type checking. For this reason, creating the schema is more difficult, an you will take (some) more time to learn them and test your schemas. Further, DTDs are anywhere from 13-28% faster than W3C Schemas (see timing below).

Fortunately, both DTD's and Schemas are supported by the Xerxes XML parser (which, as far as I can tell, is becoming the defacto standard XML parsing library).

However, you still have to use the DTD format for specifying entities (you know, those &amp; substitutions) no matter which schema you end up using, if you need entity substitutions other than the default ( &amp; &lt; &gt; &apos; &quot; and &#nn; or &#xnn;), you will need to specify a DTD somewhere.

Q: Why do I need a DTD or a schema?

You donít. Who said you did? Just make sure your XML is well formed, and then you can make sure it is valid inside your own source code application. And then, of course, youíll need to document your XML format for others who might want to use it.

If, however, you want to lean on a schema to let the XML parser check its validity for you, and if you want to use the schema as a way of communicating your XML format to others, well, thatís okay too.

Q: What are the performance costs of using XML for my data? With or without validation?

As far as I can tell (from researching the web and from my tests Ė see the actual benchmark data below), parsing data with XML is about 2-3 times slower than for a simpler tagging scheme.

Further, if you turn on validation with a DTD, the performance will be about another 70% slower (compared to validation with a W3C schema which can be another 135% slower).

So, for example, suppose if you have a simple tagging scheme, and it can parse 4mb of data in about 0.60 seconds. If it is in XML, good parsers will parse it in 1.50 seconds. XML with DTD validation should come in around 2.5 seconds. XML with W3C Schema validation should be around 3.5 seconds.

But so what? The simple fact of the world today is that programmer time and effort is more expensive than machine power. Donít use these timing tests alone to decide if youíre going to use XML or not. Consider that XML is a widely known standard, many programmers and consultants know how to use it, and that validation is a way of catching and eliminating errors in your data earlier than might be otherwise possible. It is very likely that, on balance, the performance hit for using XML will be a small price to pay for all of these other advantages.

Q: What do you mean by Ďvalidí ? How can a schema tell if my XML document is valid?

It doesnít, not really. After all, your XML data may have a personís name, but thereís no way it will be able to tell if the person actually existed, right?

"Well formed" just means that all of your begin-tags have end-tags, that your <element>s are properly nested, that youíre not missing any angle-brackets, and all your attributes have values.

"Valid" (according to the W3C), means that your XML document has only the allowed elements, the allowed attributes, that no element has the wrong attributes, that the attribute values are (generally) correct, that the element content is (generally) correct, and that the correct elements exist within other elements.

Make sense?

Q: What's the deal: Do attributes have to have values? Must they always be surrounded by double-quotes?

Yes, and yes.

This is illegal XML:  <myname  IsStandard>
This is also illegal:  <myname  IsStandard=true>

It MUST be:  <myname  IsStandard="true">

Q: I keep getting this stupid error in my XML validator:

Invalid xml declaration. Source: '<?xml version="1.0"?>' Line: 1, Pos: 22

I know that <?xml version="1.0"?> is correct. What's the deal??

You are not allowed to have <?xml version="1.0"?> in your DTD. Take it out.

Q: The line numbers and positions in my error messages are all whacky. No only does my XML look fine where it's indicated, but also the character positions are in strange places. What's the deal?

Maybe you are using an external DTD, and there are errors in this external DTD. Check the line numbers and character positions inside the external DTD to see if they make more sense there. I've found this to be a problem with "Cooktop" and "Topology Schematron Validator", although it may occur with anything that uses the Microsoft XML validator 4.0 (MSXML 4.0).

Q: Whatís a validator? Where can I get a free one?

A validator is something that will check to see if your XML document is both "well-formed" and valid (according to your schema or DTD). Validators are built into most XML parsers, but there are some freely available tools.

1. Online: http://www.validome.org/xml/ . Has excellent error reporting and handles W3C schemas, no problem. The only issue is that your schema (or DTD file, if youíre using an external DTD) must be available on-line through some URL. This means on a web-site somewhere so that validome can access it across the net.

But once your schema is available online, you can upload your XML document to Validome, and it will check it for validity.

2. http://www.w3.org/2001/03/webdata/xsv also seems to work well and is the one I use the most often. The error reporting can be cryptic, but it does allow you to show warnings (including when it is using lax evaluation and when it let's attributes go just because it can). This is often the only way to tell if it considers your document to be valid because it satisfies all of the constraints, or if it is just skipping things.

3. http://www.topologi.com/products/validator/index.html for the "Topologi Schematron Validator". Unfortunately, the error reporting is not that great (see previous two questions), but it does run directly on your computer without needing an internet connection.

Not recommended: Cooktop. 1) It doesnít handle W3C schemas, 2) it rewrites your XML input files (making them "prettier", without asking you first), and 3) it causes crashes in Windows Explorer on Windows 2000.

Q: Do you have any information on parser performance?

Here are the tests I ran on evaluating the Xerces C++ parser performance. I would expect that Java performance would be slower in every case.

Test parameters:

  1. 3728 bytes XML input file (fair amount of nesting)
  2. When validation is on, all checking is strict (i.e. not lax)
  3. Parsed the file 1000 times = 3,728,000 total bytes processed
  4. The XML file is loaded into memory before it is parsed, so no file-io costs (except, perhaps, for the I/O costs of loading the schema) are considered.
  5. The Default handler is empty (i.e. stubs defined but no actual processing code in the stubs)
  6. MS Windows 2000 Professional, SP 4
  7. Laptop with 256mb of RAM and a 1.328mhz CPU
  8. Ran each test 5 times and then took the fastest result. Generally, times would vary by less than 5% across all 5 tests.
  9. Schema files accessed from local hard drive (i.e. not from over the internet)
  10. From SAX2Count: 42 elems, 58 attrs, 556 spaces, 975 chars

Xerces C++ Parser Performance:

Time(sec) Schema Parser + Options
18.48     XSD    SAX2, Validating, Namespaces, No Schema Caching
10.65     DTD    SAX2, Validating, Namespaces, No Schema Caching
 3.37     XSD    SAX2, Validating, Namespaces, Schema Caching
 2.94     DTD    SAX2, Validating, Namespaces, Schema Caching
 2.68     XSD    SAX2, No-Validating, Namespaces, Schema Caching
 2.72     DTD    SAX2, No-Validating, Namespaces, Schema Caching
 1.95     XSD    SAX2, No-Validating, No-Namespaces, Schema Caching
 1.84     DTD    SAX2, No-Validating, No-Namespaces, Schema Caching

1.49 XML SAX2, (n/a)

18.97 XSD SAX1, Validating, Namespaces, No Schema Caching 10.03 DTD SAX1, Validating, Namespaces, No Schema Caching 3.29 XSD SAX1, Validating, Namespaces, Schema Caching 2.36 DTD SAX1, Validating, Namespaces, Schema Caching 2.64 XSD SAX1, No-Validating, Namespaces, Schema Caching 2.10 DTD SAX1, No-Validating, Namespaces, Schema Caching 2.63 XSD SAX1, No-Validating, No-Namespaces, Schema Caching 1.84 DTD SAX1, No-Validating, No-Namespaces, Schema Caching

1.40 XML SAX1, (n/a)

0.56 TOK (n/a)

Note 1: "XML" refers to an XML Well-formedness checking scanner only. This was implemented by setting the following option:

SAX2:   parser-> setProperty(XMLUni::fgXercesScannerName,
                            L"WFXMLScanner");

SAX1: parser->useScanner(L"WFXMLScanner");

Note 2: "TOK" refers to a simple token extractor which I wrote myself in C++. The token extractor extracts all punctuation, quoted strings, and words as tokens while excluding all whitespace. Note that it does *not* perform any well-formedness checking. In fact, it does no checking whatsoever (other than to ensure that quoted strings are terminated). This test was run extracting all of the tokens from the same XML file that was being used for all of the other tests.

The purpose of including the "TOK" test was to demonstrate the overhead of using the XML infrastructure as compared to simpler data parsing methods (for example, tab-delimited files).

Q: I see all of these "http://server/namespace" URLs in my documents, but when I go to the URL thereís nothing there! Why not?

Because these are not actually URLs. They are "name spaces". The only thing that a namespace has to be is *unique*, and so people use their web site addresses to prefix their namespace to guarantee that their names will be unique across the web.

Q: Where can I go to get more information on XML namespaces?

Here: http://www.rpbourret.com/xml/NamespacesFAQ.htm

This is the best FAQ on XML namespaces.

PART 2: DTDs

Q: Whatís a 10-second summary of DTDs?

DTDs (Data Type Definitions) check your document for "validity" (see other question). It contains the following pieces:

  • <!DOCTYPE> - Specifies the actual DTD (embedded), or where it can be found.
  • <!ELEMENT> - Specify what elements are allowed in your XML file, and what those elements are allowed to contain.
  • <!ATTLIST> - Specify the list of attributes which an element can contain
  • <!ENTITY> - Specify what should be substituted whenever an &entity; occurs

    DTDs are simple, easy to understand, but also limited in what they can, and can not, specify as valid. Probably you should be using schemas instead.

    Q: How do I create a DTD where sub-elements can occur in any order?

    You have to specify all the different possible orders. For example:

    <!ELEMENT myelem ((subelem1?, subelem2?) | (subelem2?, subelem1?)) >

    Q: But doesn't that get kind of crazy when I have more than two sub-elements?

    Yup. If this is really a problem for you, youíll need to use W3C Schemas instead of DTDís.

    Q: What's the deal with these "Formal Public Identifiers" (FPIs)? You know, those things used in the <!DOCTYPE PUBLIC> declaration:

     <!DOCTYPE myelement PUBLIC "-//Paul's Company//My Great Data//EN//1.0" "myelement.dtd" >

    Formal Public Identifiers are an SGML throwback. Basically, I don't think anyone could agree what to put there, so they defaulted to what SGML specified. Notice how all public identifiers start with "-//" ? This means that they are all *unregistered*.

    If you want to make something up and invent your own FPI, go ahead. The basic format is:

      -//Organization Name//Data Classification Name//Language//Version

    But really, just use <!DOCTYPE SYSTEM> for all of your external DTDs.

    Q: But, shouldn't I register my Formal Public Identifier, somehow? How do I do that?

    Don't bother. As far as I can tell, there is no way to register your FPI. ANSI was supposed to do this, but it doesn't look like it has happened yet (or will ever happen).

    PART 3: W3C Schemas

    Q: Whatís a 10-second summary of W3C Schemaís?

    Like DTDís, W3C Schemaís (or just "schema") check your XML document for "validity" (see answer above). The following is a quick summary of what it can contain:

    <schema> - The root element of the schema. Also specifies the namespace for your XML document (the targetNamespace).

    <element> - Specify the elements allowed in your document.

    <group> - Create a group of elements which can you re-use elsewhere in your schema. You <group> element must contain a content model, such as <all>, <choice>, or <sequence> as a direct child. You can then use your <group> inside of other element's <complexType> tags.

    <attributeGroup> - Specifies a group of attributes which can be reused elsewhere in multiple different elements within the schema. This is the preferred method for reusing attributes (even if the group contains only a single attribute).

    Specifying complex data values (those which contain nested <element>s):

    <complexType> - Describes the content that your elements are allowed to have, but only when they contain other elements. For example, what nested elements are allowed, in what order they are allowed to be, are the optional, etc. Also, for some strange reason, the <complexType> also specifies the attributes that are allowed on the <element>s which use it.

    <sequence> - Within a <complexType>, allows for an ordered sequence of sub-elements.

    <choice> - Within a <complexType>, allows for any of an (unordered) set of selected sub-elements to occur, possibly multiple times.

    <all> - Within a <complexType>, allows for any (or all) of the sub-elements to occur. The difference between <all> and <choice> is that each different sub-element in <all> is allowed to only occur, at most, one time.

    <any> - Allows your XML element to contain any type of sub-element from specified namespaces. Is used within a <complexType>.

    <anyAttribute> - Allows for an element to contain any type of attribute, from specified namespaces. Used within a <complexType> element.

    <attribute> - Specifies what attributes your elements are allowed to have. Used within a <complexType>.

    Specifying simple data values (those which do not contain nested <element>s):

    <simpleType> - Describes the content that either your elements or your attributes are allowed to have, but only if they do *not* contain other elements. For example, are they numbers or text, what kinds of numbers or text, what values are allowed, etc.

    <restriction> - Specifies what types of values the data within an attribute or an element is allowed to have. This can include a range of values, a list of values, the length of strings, or an enumerated list of possible values.

    <list> - Allowed for a <simpleType> to contain a list of data values.

    Okay, so maybe that was more like a 10-minute summary, rather than a 10-second summary.

    W3C schemas are better than DTDís because they are aware of namespaces, and they are more powerful. They also seem to be pretty well supported by parsers and editors. However, they are more complex and will require a (somewhat) steeper learning curve.

    Further, W3C schemas are themselves XML documents. Isnít that cool? You use an XML document to describe how your own XML document is valid? (programmers love that sort of thing)

    Q: Are there some basic rules to follow when writing W3C schemas? You know, to get me started on the right foot?

    Sure, try these on for size:

    1. Use the following schema format:

    <schema xmlns="http://www.w3.org/2001/XMLSchema"
            targetNamespace="http://your_namespace_here "
            xmlns:t="http://your_namespace_here "
            elementFormDefault="qualified">
    

    2. In your XML document, always use the following (general) format for your root tag:

    <myRootTag xmlns="http://your_namespace_here"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://your_namespace_here your_schema_doc.xsd">
    

    3. Always use "elementFormDefault="qualified" in your <schema> definition. This will make elements declared inside of other elements work better. See http://www.xml.com/pub/a/2001/06/06/schemasimple.html?page=2 for more details.

    Alternatively, if you can't use elementFormDefault="qualified" (meaning, you want to allow <elements> in your schema which don't have a namespace specified), then don't have any <element>s defined inside of other <element>s. Instead define them all as children of <schema> and then use <element ref=""> to refer to nested ones.

    4. Make sure your <element> tags have <complexType> as a direct child (for the cases where you want to specify the child elements).

    5. Never use <element name="blah"/>. Always specify a type for your <element>s, either with type="", or with a <simpleType> or with a <complexType>. If you need to refer to another element, use <element ref="t:other"/> .

    6. Whenever you want to refer to anything specified elsewhere in your schema (for example <group>, <attributeGroup>, or <element>, using the "ref=" attribute), be sure to include the namespace prefex for your target namespace. For example:

          <group ref="t:outerClusters">
    

    7. Never put <attribute> as a child of <schema> (i.e. a global attribute). Always enclose it in an <attributeGroup>. This will make the name-space prefixing work out the way it should (otherwise you may have to prefix all your attributes).

    8. Make sure that all of your <group> tags have a content model (either: <all>, <sequence>, or <choice>) as a child element. Not <complexType> or <element>.

    9. When referring to a <group>, don't forget to include minOccurs and/or maxOccurs if necessary (since these can not be specified within the <group>, they must be specified when referring to the <group>).

    Q: Why can't I put an <element> directly inside of another <element> or <complexType> tag in my W3C Schema?

    Because you must specify one of <choice>, <sequence>, or <all> inside your <complexType>. These are the "content models", which specify how the children are supposed to be chosen.

    - - - - - - illegal example: - - - - - -
    

    <element name="myelement"> <element name="A" type="string"/> <element name="B" type="string"/> <element name="C" type="string"/> </element>

    - - - - - - another illegal example: - - - - - -

    <element name="myelement"> <complexType> <element name="A" type="string"/> <element name="B" type="string"/> <element name="C" type="string"/> </complexType> </element>

    - - - - - - a legal example: - - - - - -

    <element name="myelement"> <complexType> <all> <element name="A" type="string"/> <element name="B" type="string"/> <element name="C" type="string"/> </all> </complexType> </element>

    Q: Iíve declared an <attribute> that I want to reuse with <attribute ref=""> but itís not being recognized? Whatís the problem?

    First, read the page: http://www.xml.com/pub/a/2001/06/06/schemasimple.html#avoid_complex (near the bottom) which explains the problem in more detail.

    Basically, global attributes (attributes defined as a direct child of the <schema> element in your schema document) donít work in the ways that you expect, with namespaces. So, surround your attribute with an "attributeGroup" and then reference the attributeGroup instead. Like this:

    <schema ... >
      <attributeGroup name="myattributegroup">
          <attribute ... > blah, blah, blah </attribute>
      </attributeGroup>
    </schema>
    

    And then reference the attribute with <attributeGroup ref="t: myattributegroup"> whenever you need the attribute. This seems to work much better, and doesnít have any funky name space requirements inside your XML document.

    Q: Why is it allowing any child <element> or attribute to be specified? OR, why does it seem to be ignoring my schema?

    Do you have any elements specified like this: <element name="myname"/> ?

    Elements declared without a type (like above) will allow any attribute and any child <element>s at all. *Always* specify a type for your elements. It can be either with type="string" (or similar), or using a nested <complexType> or a nested <simpleType> element.

    Q: The validator says that the <choice> (or <sequence> or <all>) tag can not have the minOccurs tag. Whatís up with that?

    Is your <choice> or <sequence> the child of a group tag? If so, then this is correct. For example, the following is not allowed:

    - - - - - - non-working example: - - - - - -
    

    <group name="mygroup"> <choice minOccurs="0" maxOccurs="unbounded"> . . . </choice> </group>

    later:

    <group ref="t:mygroup">

    Instead, do the following:

    - - - - - - working example: - - - - - -
    

    <group name="mygroup"> <choice> . . . </choice> </group>

    later:

    <group ref="t:mygroup" minOccurs="0" maxOccurs="unbounded">

    Q: How can I specify that the child elements are allowed to occur in any order?

    Use the <all> tag in your W3C schema. For example:

      <element name="myelement">
        <complexType>
          <all>
            <element name="A" type="string"/>
            <element name="B" type="string"/>
            <element name="C" type="string"/>
          </all>
        </complexType>
      </element>

    In the above example, you can specify the elements <A>, <B>, and <C> in any order within the <myelement> tag. But note, they must all occur. See below when you also want to make them optional.

    Q: Why is it that my <all minOccurs="0"> tag is actually requiring all of the child <element>s instead of allowing them to be optional?

    The minOccurs tag should be placed on the sub-elements, and not on the <all> tag itself. For example:

    - - - - - - non-working example: - - - - - -

      <all minOccurs="0" maxOccurs="1">
        <element name="description" type="string"/>
        <element name="args" type="string"/>
        <element name="implementation" type="string"/>
        <element name="defaultval" type="string"/>
      </all>
    

    - - - - - - working example: - - - - - -

    <all> <element name="description" type="string" minOccurs="0" maxOccurs="1"/> <element name="args" type="string" minOccurs="0" maxOccurs="1"/> <element name="implementation" type="string" minOccurs="0" maxOccurs="1"/> <element name="defaultval" type="string" minOccurs="0" maxOccurs="1"/> </all>

    Q: It keeps saying "Child element is not expected at this point" but I look at my schema and the tag is definitely there. Whatís the deal?

    Is it possible that youíre missing a maxOccurs somewhere? It may be that the validator is not complaining about a specific <element>, but rather about the quantity of <elements> allowed (or, at least, this happened to me a lot).

    Q: How do I specify a document with infinitely nested elements? You know, like this:

    <cluster . . . >
      <A/>
      <B/>
      <cluster>
        <A/>
        <B/>
        <cluster>
          <A/>
          <B/>
        </cluster>
      </cluster>
    </cluster>

    This seems to work:

    <schema xmlns="http://www.w3.org/2001/XMLSchema"
            targetNamespace="http://your_namespace_here"
            xmlns:t="http://your_namespace_here"
            elementFormDefault="qualified">
      <element name="cluster">
        <complexType>
          <choice minOccurs="0" maxOccurs="unbounded">
            <element name="A" type="string"/>
            <element name="B" type="string"/>
            <element ref="t:cluster"/>
          </choice>
        </complexType>
      </element>
    </schema>

    Q: How would I allow one level of nesting, but not two?

    Try something like this (notice how the innermost "cluster" element has it's type fully defined):

    <schema xmlns="http://www.w3.org/2001/XMLSchema"
            targetNamespace="http://your_namespace_here"
            xmlns:t="http://your_namespace_here"
            elementFormDefault="qualified">
    

    <element name="cluster"> <complexType> <choice minOccurs="0" maxOccurs="unbounded"> <element name="A" type="string"/> <element name="B" type="string"/> <element name="cluster"> <complexType> <choice minOccurs="0" maxOccurs="unbounded"> <element name="A" type="string"/> <element name="B" type="string"/> </choice> </complexType> </element> </choice> </complexType> </element>

    </schema>

    - - - - - - This schema will allow: - - - - -

    <cluster . . . > <A/> <B/> <cluster> <A/> <B/> </cluster> </cluster>

    - - - - - - but it will *not* allow: - - - - -

    <cluster . . . > <A/> <B/> <cluster> <A/> <B/> <cluster> <A/> <B/> </cluster> </cluster> </cluster>

    Q: How would I implement an XML document which allows for infinite number of nested children, but where the root is a different <element> ? You know, like this:

    <system . . .>
      <A/>
      <B/>
      <cluster>
        <A/>
        <B/>
        <cluster>
          <A/>
          <B/>
        </cluster>
      </cluster>
    </system>

    Try using groups. Like so:

    <schema xmlns="http://www.w3.org/2001/XMLSchema"
            targetNamespace="http://your_namespace_here"
            xmlns:t="http://your_namespace_here"
            elementFormDefault="qualified">
      <group name="mygroup">
        <choice>
          <element name="A" type="string"/>
          <element name="B" type="string"/>
          <element name="cluster">
            <complexType>
              <group ref="t:mygroup" minOccurs="0" maxOccurs="unbounded"/>
            </complexType>
          </element>
        </choice>
      </group>
    

    <element name="system"> <complexType> <group ref="t:mygroup" minOccurs="0" maxOccurs="unbounded"/> </complexType> </element> </schema>

    Q: How can I have an element which has different children, depending on who its parent is?

    For example, in the following, <cluster> is allowed to have children <A>, <B>, and <C>, when it occurs inside of <system>, but only <X> and <Y> when it occurs inside of <C>. Like this:

    <system . . . >
      <cluster>
        <A/>
        <B/>
        <C>
          <cluster>
            <X/>
            <Y/>
          </cluster>
        </C>
      </cluster>
    </system>

    In order to implement this type of XML format, you will need to specify a nested <element> for cluster inside of the <system>/<C> element. Like this:

    <schema xmlns="http://www.w3.org/2001/XMLSchema"
            targetNamespace="http://your_namespace_here"
            xmlns:t="http://your_namespace_here"
            elementFormDefault="qualified">
      <element name="system">
        <complexType>
          <sequence>
            <element name="cluster">
              <complexType>
                <choice minOccurs="0" maxOccurs="unbounded">
                  <element name="A" type="string"/>
                  <element name="B" type="string"/>
                  <element name="C">
                    <complexType>
                      <sequence>
                        <element name="cluster">
                          <complexType>
                            <choice minOccurs="0" maxOccurs="unbounded">
                              <element name="X" type="string"/>
                              <element name="Y" type="string"/>
                            </choice>
                          </complexType>
                        </element>
                      </sequence>
                    </complexType>
                  </element>
                </choice>
              </complexType>
            </element>
          </sequence>
        </complexType>
      </element>
    

    </schema>

    Q: Suppose I have a recursively nested element. How can I specify that it should have certain children based on who it's parent (or recursive grandparent) is?

    For example, suppose I have a <cluster> element which allows <A>, <B>, and <C> if it is underneath a <system> element, but only <X> and <Y> if it is underneath a <C> element?

    This situation looks like this:

    <system xmlns="http://your_namespace_here"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://your_namespace_here cluster.xsd">
      <cluster>
        <A/>
        <B/>
        <cluster>
          <A/>
          <B/>
          <C>
            <cluster>
              <X/>
              <Y/>
              <cluster>
                <X/>
                <Y/>
              </cluster>
            </cluster>
          </C>
        </cluster>
      </cluster>
    </system>

    We can handle this situation with the <group> tag. And here's the schema which can handle this situation:

    <schema xmlns="http://www.w3.org/2001/XMLSchema"
            targetNamespace="http://your_namespace_here"
            xmlns:t="http://your_namespace_here"
            elementFormDefault="qualified">
      <group name="outerClusters">
        <choice>
          <element name="A" type="string"/>
          <element name="B" type="string"/>
          <element name="C">
            <complexType>
              <group ref="t:innerClusters" 
                     minOccurs="0" maxOccurs="unbounded"/>
            </complexType>
          </element>
          <element name="cluster">
            <complexType>
              <group ref="t:outerClusters" 
                     minOccurs="0" maxOccurs="unbounded"/>
            </complexType>
          </element>
        </choice>
      </group>
    

    <group name="innerClusters"> <choice> <element name="X" type="string"/> <element name="Y" type="string"/> <element name="cluster"> <complexType> <group ref="t:innerClusters" minOccurs="0" maxOccurs="unbounded"/> </complexType> </element> </choice> </group>

    <element name="system"> <complexType> <group ref="t:outerClusters" minOccurs="0" maxOccurs="unbounded"/> </complexType> </element> </schema>

    Q: How do I specify an attribute that has an enumerated list of possible values? This used to be so easy in the old DTD format!

    Yes, I know. It's more complicated now. Here's the method:

    Use the <simpleType> tag with the <enumeration> tag in your W3C schema. For example:

        <attribute name="visibleTo" default="class">
          <simpleType>
            <restriction base="string">
              <enumeration value="user"/>
              <enumeration value="package"/>
              <enumeration value="class"/>
              <enumeration value="group"/>
            </restriction>
          </simpleType>
        </attribute>

    On the plus side, you could enclose this attribute in an <AttributeGroup> and then re-use it in many different <element>s. Or, you could put the simple types as "global" type (i.e. right underneath the <schema> tag) and then refer to it with "t:typename" whenever you need it.

    Q: How can I specify that an attribute can have a list of white-space separated tokens?

    That's easy. Use:

    <attribute names="name" type="NMTOKENS"/>

    (note that NMTOKENS, the type specified in the old DTD format, is still available in the W3C schema format)

    Q: I keep getting this error message: "Unique Particle Attribution". During validation against this schema, it says "ambiguity would be created for those two particles". What is is trying to tell me?

    This will happen when the validator might need to do some fancy look-ahead (or looking down the tree of <element>s) in order to resolve which way it might need to process.

    Example 1: Suppose I wanted to parse the following situation:

      <testcollection>
        <test>
          <A/>  <!-- cannot include both <A> and <B> -->
          <A/>
        </test>
        <test>
          <B/>  <!-- cannot include both <A> and <B> -->
          <B/>
        </test>
      </testcollection>

    A poor way to implement this situation might be like this:

      <element name="testcollection">
        <complexType>
          <choice>
            <element name="test">
              <complexType>
                <sequence>
                  <element name="A" minOccurs="1" maxOccurs="unbounded"
                           type="string"/>
                </sequence>
              </complexType>
            </element>
            <element name="test">
              <complexType>
                <sequence>
                  <element name="B" minOccurs="1" maxOccurs="unbounded"
                           type="string"/>
                </sequence>
              </complexType>
            </element>
          </choice>
        </complexType>
      </element>
    

    Note that the <choice> tag in the above schema has two nested <test> elements. Which one should it choose when it comes to a <test> tag?

    The problem is that the parser won't be able to tell if it should choose the first one or the second one until it analyzes the child <element>s. But parsers do not have this type of look-ahead, so this is flagged as an error in the schema document.

    A better schema for this situation would have been:

      <element name="testcollection">
        <complexType>
          <sequence>
            <element name="test">
              <complexType>
                <choice>
                  <element name="A" type="string"/>
                  <element name="B" type="string"/>
                </choice>
              </complexType>
            </element>
        </complexType>
      </element>

    Now, suppose I had the more complicated situations of recursive groups:

    - - - - - -  first situation:  - - - - - - -
    

    <group> <A/> <!-- Note: only <A>'s allowed from here on out --> <group> <A/> </group> </group>

    - - - - - - second situation: - - - - - - -

    <group> <B/> <!-- Note: only <B>'s allowed from here on out --> <group> <B/> </group> </group>

    The idea is that once the group contains either an <A> or a <B>, it should only contain <A>'s or <B>'s thereafter.

    Initially, I tried to implement this as follows:

    - - - - - -  non-working schema attempt:  - - - - - - -
    

    <group name="AGroups"> <choice> <element name="A" type="string"/> <element name="group"> <complexType> <group ref="w:AGroups" minOccurs="0" maxOccurs="unbounded"/> </complexType> </element> </choice> </group> <group name="BGroups"> <choice> <element name="B" type="string"/> <element name="group"> <complexType> <group ref="w:BGroups" minOccurs="0" maxOccurs="unbounded"/> </complexType> </element> </choice> </group>

    <element name="group"> <complexType> <choice> <group ref="w:AGroups" minOccurs="0" maxOccurs="unbounded"/> <group ref="w:BGroups" minOccurs="0" maxOccurs="unbounded"/> </choice> </complexType> </element>

    Okay, fine and dandy. But what happens in the following situation:

    - - - - - -  troublesome situation:  - - - - - - -
    

    <group> <group> <group> <B/> </group> </group> </group>

    When the XML parser reaches the outer-most <group> tag, where does it go? Does it process AGroups or BGroups? There is no to tell. In fact, the parser has to go *way* down the tree in order to figure that the inner-most group is a BGroup, and so therefore all of the parents need to be BGroup's as well.

    Parsers are not this smart, so basically, this situation is simply not allowed.

    Instead, I had to change my XML format to be something like this:

    - - - - - -  first situation:  - - - - - - -
    

    <AGroup> <A/> <AGroup> <A/> </AGroup> </AGroup>

    - - - - - - second situation: - - - - - - -

    <BGroup> <B/> <BGroup> <B/> </BGroup> </BGroup>

    And now the schema is easy to write and works fine (this is left as an exercise for the student).

    Q: Hey, I just realized that my document root element is misspelled. And yet my document is supposedly "valid". How could that be?

    (*sigh*) Yes, I know. Very strange and frustrating. The W3C Schema definition allows for anything to be the document root. Why is that? I don't know. Maybe it has something to do with XHTML.

    For example, the following is considered to be valid:

    - - - - - -  XML Document: - - - - -
    

    <zzz . . . > <A x="test"/> </zzz>

    - - - - - - XML Schema: - - - - -

    <schema . . . > <element name="A"> <complexType> <attribute name="x" type="NMTOKEN" use="required"/> </complexType> </element> </schema>

    I think that the W3C intends for your application to check the document root. If you want to work-around it in the mean-time, use the http://www.w3.org/2001/03/webdata/xsv validator an look for the following message: "No declaration for document root found, validation was lax" .

    Q: When I run my validity check with the W3 validator ( ), I get these strange warning messages:

    Warning: allowing {**mynamespace**}:myattr as child because it matched wildcard(##any)

    Why?

    I think that this happens when you define an empty element, like this:

    <element name="test"/>

    It seems to allow wildcard attributes in this situation.

    This happened to me when I was using <element name="test"/> rather than <element ref="t:test"/>. For example:

    - - - - - -  Sample non-working XML document:  - - - - - -
    

    <A . . . myattr="good"> <members> <A junk="bad"> <!-- This line is valid with no errors --> <members> <B/> </members> </A> <B/> </members> </A>

    - - - - - - The incorrect schema which creates this situation: - - - - - -

    <schema . . . > <element name="A"> <complexType> <sequence> <element name="members" minOccurs="0"> <complexType> <group ref="t:groupTest" minOccurs="0" maxOccurs="unbounded"/> </complexType> </element> </sequence> <attribute name="myattr" type="string" use="required"/> </complexType> </element>

    <group name="groupTest"> <choice> <element name="A"/> <!-- This should be a ref="" --> <element name="B"/> </choice> </group> </schema>

    Changing the line identified above from:

      <element name="A"/>

    to:

      <element ref="t:A"/>

    fixes the problem.

    PART 4: The Xerces SAX Parsers

    Q: Why can't my compiler can't seem to find the most basic SAX classes, like DefaultHandler? I know I'm including the correct include files.

    This is because you forgot "XERCES_CPP_NAMESPACE_USE" in your file. For example:

    #include "xercesc/sax2/DefaultHandler.hpp"
    
    XERCES_CPP_NAMESPACE_USE
    
    class MySAX2Handler : public DefaultHandler {
    public:
        MySAX2Handler::MySAX2Handler();
    
        void startElement(
            const   XMLCh* const    uri,
            const   XMLCh* const    localname,
            const   XMLCh* const    qname,
            const   Attributes&     attrs
        );
    .
    .
    .
    

    Q: How do I remove the duplicate attributes reported by the xerces SAX2 parser?

    Very strange. For example, when I run "SAX2Print" (one of the example programs for using the SAX2 parser provided by Xerces), I get the following output:

        <handler name="resize" visibleTo="package" visibleTo="class">

    The first "visibleTo" attribute is the one I actually put on the <handler> element inside my .XML file. The second one is the default value specified for the <handler> element in the .XSD (W3C schema defition) file.

    As far as I can tell, these duplicate attributes happen in the following situation: 1. Using the Xerces SAX2 parser 2. Validation is ON 3. Your schema has attributes with default values specified 4. Hour XML file specifies a non-default value 5. You are using a W3C schema file (it works with just an ordinary DTD).

    What happens is that xerces reports the value set on the attribute, AND the default from the schema.

    I think this is a bug in the Xerces parser. I donít know how to fix it, except to check for the duplicate in your own code.

    Sorry.

    Q: Why, when I use Xerces C++, does my program crash on exit?

    Are you calling: XMLPlatformUtils::Terminate(); ?

    This routine will close up the Xerces API. If you call Terminate() before all of your Xerces objects have been deleted (especially the ones allocated on the *stack* inside of main() ), then your program will crash on exit.

    The best solution, I think, is to put the stack-allocated objects in sub-routines. Or simply allocate them on the heap and then "delete" them before calling Terminate().

    Q: Do you have an example of parsing an XML document from memory?

    Sure, try this (Xerces C++, SAX2 parser Ė everything from the standard example program has been removed):

    // *** Other include files go here
    #include <xercesc/framework/MemBufInputSource.hpp>
    

    int main(int argc, char* argv[]) { char buf[100000]; int nBuf; int c;

    // *** Initialize Xerces here *** . . .

    FILE *fp = fopen("class.xml", "rb"); nBuf = 0; while( (c = fgetc(fp)) != EOF ) buf[nBuf++] = c; buf[nBuf] = '\0';

    MemBufInputSource *pSource = new MemBufInputSource((XMLByte *)buf, nBuf, L"class.xml");

    // *** execute the parse routine here, as in parser->parse(*pSource)

    // *** Do other cleanup here

    delete pSource;

    // *** Terminate Xerces here }

    Q: How can I get my Xerces SAX-1 Parser to recognize namespaces or my W3C XML Schema? It seems like it only understands DTDs.

    Try setting the SAXParser::useScanner() variable, as follows:

      SAXParser* parser = new SAXParser();
      parser->setDoValidation(true);    // optional.
      parser->setDoNamespaces(true);    // optional
    

    MySAXHandler* docHandler = new MySAXHandler(); ErrorHandler* errHandler = (ErrorHandler*) docHandler; parser->setDocumentHandler(docHandler); parser->setErrorHandler(errHandler); parser->useScanner(L"SGXMLScanner");