What a zinger of a title 😆. Yes, casual reader, this topic is probably as boring as it sounds. I take no offense in you leaving now. If you’re still reading, it’s because you also must accomplish the title’s task, and probably encountered one of the dozen issues I will describe. I am sorry, but continue reading, I hope this will help.
For the 2019 Tax Season, the Canadian Revenue Agency (CRA) gave new instructions that institutions now need to send them an electronic report of all the T2202 forms they issued to students in 2019.
This electronic report is to be an XML file, meeting the specifications defined in their schema files, as mentioned in their guide and “schema” description pages (which are a strange combination of XML and comments.)
XML is an epic file format. It became so over-engineered and confusing that the software industry has been steering clear of it for a decade. Except government.
Also, the school’s system at Pacific Rim Early Childhood Institute Inc, who employs me part-time, is written in Python 2. If I weren’t messing around with this XML report, I would have been working on updating it to Python 3 😅.
Lastly, as if making XML and Python 2 play nice wasn’t going to be interesting enough, some good ol’ mistakes and lack of real documentation made this project a real doozy.
So if you have a similar task, I think reading over what I did, what things I learned, and what challenges I had to overcome, may be helpful. Let’s get started.
Ok, so I already have Python 2 setup on a server, with a site running on Django (but the web framework really didn’t play a part in any of this.)
I went to the CRA’s guide Filing Information Returns Electronically (T4/T5 and other types of returns) – How to file . Halfway down the page is a link to download the information schema to be used in 2020. I downloaded it to my local machine. (Spoiler! If you’re following along, don’t do that. At the time of writing, their file has a bunch of problems in it!)
I unzipped the file, and found it contained a handful of XSD files.
In case you’re rusty on what XSD files are, this tutorial may be helpful. But they define the structure (or schema) of the XML file you need to create.
Now that I had the XSD files, I didn’t want to have to read through them manually to determine the structure of the XML file I wanted (full disclosure: I have needed to anyway 😩). It’s best to instead have a program read through it all, and generate code you can use.
I searched around, and the tool I ended up using was GenerateDS. It’s a command line tool, written in Python, that will take the XSD files as input, and create Python classes as the output.
GenerateDS’s documentation mentions a few ways to download it, but the easiest I found was to use the Python package manager pip (I already had it installed, so may as well.)
pip install generateDS on the server, and it took care of downloading generateDS, its dependencies, and added the binary file
generateDS onto the path so I could use it as a command from anywhere.
Given a valid XSD file, generateDS makes generating XML to meet that schema easy. You’re supposed to be able to just run a command like
generateDS -o cra.py -s subcra.py schema.xsd and it will read
schema.xsd and then write a bunch of Python classes to
subcra.py. All you need is a valid XSD file… It turned out just getting that wasn’t as easy as planned.
The first problem I faced in using these files was the occasional accented characters they contain. One of the big differences between Python 2 and Python 3 is support for these special characters (Python 2 can support it, like I blogged a while ago, but you need to explicitly enable it; and code that works with Python 2 and 3 can easily overlook this.)
When I ran the generateDS on the original XSD files, I got this error:
Traceback (most recent call last): File "/usr/local/bin/generateDS", line 11, in <module> sys.exit(main()) File "/usr/local/bin/generateDS.py", line 8752, in main superModule=superModule) File "/usr/local/bin/generateDS.py", line 8166, in parseAndGenerate prefix, root, options, args, superModule) File "/usr/local/bin/generateDS.py", line 7917, in generate generateSimpleTypes(wrt, prefix, SimpleTypeDict, root) File "/usr/local/bin/generateDS.py", line 7846, in generateSimpleTypes writeEnumClass(simpleType) File "/usr/local/bin/generateDS.py", line 7815, in writeEnumClass output += docstring if docstring else '' UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 306: ordinal not in range(128)
I suspect that’s only a problem with generateDS when using Python 2 on XSD files with accented characters. So I manually scanned through them and removed all the accented characters from the descriptions. That removed that error.
The next problem I discovered was that I didn’t know which was the “main” XSD file. Like I mentioned before, I was given a ZIPped folder (xmlschm1-20-3.zip) containing a half dozen XSD files, but generateDS expects one: the one that actually describes the top-level element of the XML file you’re supposed to be generating. The other ones were all imported by that main one, and so only describe part of the schema- and no where did the CRA actually say which one to use (that’s like them giving you several pages of instructions, but in a random order so you don’t know where to start.)
At first I assumed it was the one with a name matching what I wanted to generate:
t2202.xsd (for making a T2202 tax form). Using
generateDS on it initially worked, but didn’t produce anything actually usable. I realized the file didn’t define any elements I was supposed to put in the XML file. It just defined types. So I hunted around for a file that actually defined a top-level element. Eventually, I found
layout-topology.xsd, which was it.
I went to the folder with the XSD files, and ran
generateDS -o cra.py -s subcra.py layout-topology.xsd. That’s where I got the next problem…
When I ran
layout-topology.xsd, I got this error:
Traceback (most recent call last): File "/usr/local/bin/generateDS", line 11, in <module> sys.exit(main()) File "/usr/local/bin/generateDS.py", line 8752, in main superModule=superModule) File "/usr/local/bin/generateDS.py", line 8145, in parseAndGenerate no_redefine_groups=noRedefineGroups, File "/usr/local/bin/process_includes.py", line 180, in process_include_files infile, outfile, inpath, options) File "/usr/local/bin/process_includes.py", line 493, in prep_schema_doc schema_ns_dict, rename_data, options) File "/usr/local/bin/process_includes.py", line 350, in collect_inserts rename_data, options) File "/usr/local/bin/process_includes.py", line 362, in collect_inserts_aux string_content = resolve_ref(child, params, options) File "/usr/local/bin/process_includes.py", line 317, in resolve_ref raise SchemaIOError(msg) process_includes.SchemaIOError: Can't find file ../xmlschm1-20-3-ascii/t4.xsd referenced in ../xmlschm1-20-3-ascii/complex.xsd. Exception SystemError: '../Objects/codeobject.c:64: bad argument to internal function' in <generator object at 0xb675761c> ignored
I looked inside the XSD file, and sure enough there was an import line referencing a file named
t4.xsd, but the zip they provided didn’t have it. Some engineer at the CRA left it out 😒.
I double-checked I downloaded the right ZIP file, and I did. But I also noticed the previous year’s ZIP file, xmlschm1-19-1.zip, did contain those missing files. So I manually merged the two folders 😫 (when both folders had the same file, I kept the one from the newer zipped folder, xmlschm1-20-3).
Here’s a copy of the working zip I created, in case you want it:
Then, with a complete schema, I re-ran
generateDS -o cra.py -s subcra.py layout-topology.xsd, and it worked 😍:
subcra.py were full of Python classes derived from the plethora of XSD files I had been given.
Before I started using the auto-generated code, I had to do a couple things.
subcra.py, in the line
import ??? as supermod change
??? to be whatever the module name is for where the files are located. I put them inside my project in
libs/cra/, so I changed the line to
import libs.cra.cra as supermod.
I also needed to make sure the folder that contains
subcra.py also had a
__init__.py file, so that Python recognizes it as a python module, and I can then use them from elsewhere in my code.
At one point I thought using generateDS would mean I wouldn’t need to read and understand the XSD files. That would have been great, but it didn’t turn out that way.
It seems the classes generated by generateDS allow you to make syntactically-correct XML, but they only do a bit of validation for you, and they’re not terrifically documented. So by themselves, it’s hard to know what arguments you can pass the classes’ constructor and methods. For that, you need to read the XSD files.
If you’re like me and rusty on the format of XSD files, it’s probably good to review the W3 XML Schema Tutorial. That’ll remind you about Elements, Simple Types, Complex Types, and all that.
So it’s helpful to open up the top-level XSD file, in this case
layout-topologie.xsd. After the opening line that declares the file to be XML, and some comments, it contains this:
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <!-- @@@@@@ Include Related Schemas @@@@@@ 2019/May/7 Version# 1.19 (version #.yy)--> <xsd:include schemaLocation="simple.xsd"/> ...tons of other schemas... <xsd:include schemaLocation="t2202.xsd"/> <!-- Add T2202 May 2019 --> <xsd:include schemaLocation="frms.xsd"/> <!-- @@@@ Common Record Layout @@@@ --> <xsd:element name="Submission" type="ReturnType"/> <xsd:complexType name="ReturnType"> <xsd:sequence> <xsd:element name="T619" type="TransmitterType"/> <xsd:element name="Return" type="ReturnChoiceType" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="TransmitterType"> <xsd:all> <xsd:element name="sbmt_ref_id" type="char8Type"/> <xsd:element name="rpt_tcd" type="otherDataType"/> <xsd:element name="trnmtr_nbr" type="transNbrType"/> <xsd:element name="trnmtr_tcd" type="indicator1-4Type" minOccurs="0"/> <xsd:element name="summ_cnt" type="int6Type"/> <xsd:element name="lang_cd" type="languageType"/> <xsd:element name="TRNMTR_NM" type="Line2Type"/> <xsd:element name="TRNMTR_ADDR" type="CanadaAddressType"/> <xsd:element name="CNTC" type="ContactType"/> </xsd:all> </xsd:complexType> </xsd:schema>
So it first includes all the other XSD files, then finally mentions the top-level element:
Submission, which it says is of type
ReturnType. So with that we know the XML will need to look something like this:
<Submission> ... </Submission>
But what attributes can
Submission have? And what’s the next element under it? We need to find where its type,
ReturnType is declared… It could be in any of those other XSD files, but, this time, it’s on the very next line.
<xsd:complexType name="ReturnType"> <xsd:sequence> <xsd:element name="T619" type="TransmitterType"/> <xsd:element name="Return" type="ReturnChoiceType" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType>
ReturnType is a complex type, meaning it has sub-elements, named
Return (in that sequence), of types
ReturnChoiceType, respectively. Oh, and there can be as many
Return elements as we want. So we know a little bit more about the XML we want:
<Submission> <T619>...</T619> <Return>...</Return> ... <Return>...</Return> </Submission>
And from there, we need to find where the types
ReturnChoiceType are declared. Once we find them, we’ll know we’ll need to add them, and find the types of their child elements, etc. (By the way,
TransmitterType is in this same file, but
ReturnChoiceType is elsewhere… I used
grep ReturnChoiceType * from the command-line to find that it was in
So that’s how it goes with understanding the XSD file. You could start building the XML from basic strings, based on what you’ve read in the XSD file. But there’s quite a bit of repetitive code you’ll need, which generateDS took care of for us. But for that, you’re going to need to figure out how the XSD corresponds to the Python Classes.
Simple Types map to Python built-in types (eg unicode strings, ints, floats, etc); whereas XML complex types map to Python classes (that were placed in
cra.py by GenerateDS for you.) But which ones?
Each complex type declared in the XSD files corresponds to a Python Class by the same name. Eg, the top-level element was named
Submission and it was of type
ReturnType. So I searched through
class ReturnType, and sure enough there it was. Something like this:
class ReturnType(GeneratedsSuper): __hash__ = GeneratedsSuper.__hash__ subclass = None superclass = None def __init__(self, T619=None, Return=None, gds_collector_=None, **kwargs_): ... ...then a bunch of setter and getter methods ...and some export methods (for making the XML) ...and some build methods (for interpreting XML)
So in there we see what arguments we can pass to this class:
Return, and something called
gds_collector (not sure what that one was for.)
So the Python code to generate that could look like
from libs.cra.cra import ReturnType submission = ReturnType(T619=None, Return=None)
If you’re like me, at this point you’re dying to make at least the tinyest XML file to confirm you’re on the right track. So you can use that
submission to generate a string of XML with the following
import StringIO output = StringIO.StringIO() submission.export(output, 0, name_='Submission') generated_xml = output.getvalue()
generated_xml out to the page, console, or to a file, whatever. It should produce glorious XML like the following:
To add the sub-elements,
Returns, you need to:
- find where those elements were declared in the XSD files (they’re part of the
ReturnTypecomplex type, located in
- find their XSD types in the XSD files (they’re of type
- find classes in
cra.pywith the same names as those complex types (
Then use those classes in your code too. But what arguments do they take? Recurse (ie, start at step 1 again). Yes, it gets tedious, but you do eventually hit the bottom.
Sometimes tag names differ from their type. In that case, use the Python class that corresponds to the XSD type; then set
original_tagname_ to the name the tag is supposed to have; and
extensiontype_ to the tag’s type. Eg
student_address_line1 = Character30TextType( valueOf_ = student.street_address ) student_address_line1.original_tagname_ = 'AddressLine1Text' student_address_line1.extensiontype_ = 'Character30TextType'
It’s pretty tedious. I know.
Once I finally generated some XML, I saw the dollar amount fields were being formatted strangely: there were no numbers after the decimal place. Eg it showed “4615.” instead of “4615.00”.
Seeing how generateDS created usable Python code, I decided to try to hunt down the problem in the generated code and fix it myself.
SchoolSession XML tag corresponded to the Python class
SchoolSessionType. So I looked at its methods for generating (or “exporting”) an XML string. It had a method called
export, which called
exportChildren to generate the XML for the sub-tags. In that method, I saw it used
self.gds_format_decimal to format the dollar amount. I found that method on
GeneratedsSuper, and it looked like this:
def gds_format_decimal(self, input_data, input_name=''): return ('%0.10f' % input_data).rstrip('0')
input_data is a
Decimal). So it’s formatting the float to have 10 decimal places ( eg
4615.0000000000), but then removing all the
0 characters on the right side of the string (eg
So I modified the method to be this:
def gds_format_decimal(self, input_data, input_name=''): return '%0.2f' % input_data
That’s probably not generally good, as it forcing ALL XML that has decimal numbers to only be shown up to 2 decimal places. But I’m just working with dollars and cents, not precise scientific measurements. This produced numbers with 2 decimal places, so this did the trick.
After I generated what looks like valid XML, it’s time to get a machine to run a more thorough scan of it to double-check it’s all good. For that, I found the Python library xmlschema.
While there are a ton of XML validating web services out there, they all worked with only one XSD file — not a folder full of them. Or they would use a URL to an XML schema, but I didn’t feel like uploading the directory of XML schemas I had been provided to somewhere random to get them working like that.
Plus, using a Python library allows me to automatically run the validation before generating the XML file, rather than making running validation on the generated XML file manually-invoked process.
Installing and using the library was super easy:
pip install xmlschema. Then after my code had generated the XML, I ran
from xmlschema import XMLSchema schema = XMLSchema('libs/cra-xsds/layout-topologie.xsd') schema.validate(generated_xml)
If there was a validation error, I’d get an exception with a ton of helpful details, like the following:
Here it’s telling me the xml
<StudentName>Colin 'O Hare Nelson</StudentName> is invalid because “character data between child elements not allowed”, and it even shows the relevant section of the XSD file that describes what the XML should look like. It seems I had incorrectly thought
StudentName was a simple XML element, but it’s actually a complex XML element with sub-elements
GiveName (optional), and
While I was resolving some validation issues regarding how I generated the XML, I discovered a couple issues that seem to have been more generateDS’s fault.
One of the validation errors I got was this:
XMLSchemaValidationError at /admin/cra_report/
failed validating <Element ‘StudentName’ at 0x8f5749ac> with XsdGroup(model=u’sequence’, occurs=[1, 1]):
Reason: character data between child elements not allowed!
Somehow the generated Python class for
T2202SlipType thought its child element
StudentName should be a simple type— ie a simple string. But the XSD files specifically said it needed to be a complex type (an element with sub-elements). So I needed to manually fix the Python class
exportChildren, where it said
if self.StudentName is not None: namespaceprefix_ = self.StudentName_nsprefix_ + ':' if (UseCapturedNS_ and self.StudentName_nsprefix_) else '' showIndent(outfile, level, pretty_print) outfile.write('<%sStudentName>%s</%sStudentName>%s' % (namespaceprefix_ , self.gds_encode(self.gds_format_string(quote_xml(self.StudentName), input_name='StudentName')), namespaceprefix_ , eol_))
I had to replace it with the following:
if self.StudentName is not None: namespaceprefix_ = self.StudentName_nsprefix_ + ':' if (UseCapturedNS_ and self.StudentName_nsprefix_) else '' self.StudentName.export(outfile, level, namespaceprefix_, namespacedef_='', name_='StudentName', pretty_print=pretty_print)
I also found generateDS’s “pretty print” option was sometimes randomly adding a ton of whitespace inside elements, which made them invalid if they had a character limit.
For example, look at this XML it generated:
<ContactInformation> <ContactName>George Arbuckle </ContactName> <ContactAreaCode>250</ContactAreaCode> <ContactPhoneNumber>652-1234</ContactPhoneNumber> </ContactInformation>
See how the tag
ContactName has a TON of whitespace after its content “George Arbuckle”? That was being added by GenerateDS (it’s not part of my data).
Turning pretty printing off fixed it. Eg
submission.export(output, 0, name_='Submission', namespacedef_=XMLNS, pretty_print=False)
I also got an error like
XMLSchemaKeyError at /admin/cra_report/
u”missing an XsdSimpleType or XsdComplexType component for ‘George Arbuckle’! As the name has no namespace maybe a missing default namespace declaration.”
I was quite confused by this. If I had requested to make an element of type “George Arbuckle” I think I would have remembered that. In the XML I saw this:
<ContactInformation> <ContactName xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="George Arbuckle">George Arbuckle </ContactName> <ContactAreaCode>250</ContactAreaCode> <ContactPhoneNumber>652-1234</ContactPhoneNumber> </ContactInformation>
I had no idea where
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="George Arbuckle" was coming from. The Python code in-use was:
contact_name = ContactName( valueOf_=TRANSMITTER_NAME_LINE1[:22], ) contact_info = ContactType3( ContactName = contact_name, ContactAreaCode=TRANSMITTER_AREA_CODE, ContactPhoneNumber=TRANSMITTER_PHONE ) contact_info.original_tagname_ = 'ContactInformation' contact_info.extensiontype_ = 'ContactType3'
I noticed in the XML schema though, that this tag
ContactInformation was optional. So I just gave up trying to add it. (If you know the actual reason for the error please comment!)
And after all that, I was finally generating an XML file that passed validation according to the CRA’s XSD files. 🎉🎉🎉
But there were still some issues about the content that weren’t quite clear. I asked the school’s accountant, who asked the CRA, and on whom we’re still awaiting a reply. (I’ll update the post once they respond.) The questions are:
- Do we send the T2202 forms to ALL students, or just to those who are enrolled in a course during the year?
- If a student is enrolled in a course during 2019, but that enrollment’s payment was received in 2018, on what year do we show them as enrolled? If it’s 2019, what do we set the “Eligible Tuition Fee Amount” to? (Considering we didn’t actually receive the funds in 2019.)
- If the student has not provided us with their Social Insurance Number (SIN, which the CRA now requires we collect), they said we just need to keep a record that we’ve attempted to collect it from them (otherwise there is a fine). So in some circumstances, it’s expected students will not have provided us with their SIN. But the XSD files still think it’s a requirement, and indicate the submitted XML file is invalid if the element is blank. So should we not report on students without SINs? Should we use a placeholder like “000000000”?
Well that was a wild ride, that’s not entirely completed. If you’re a lucky soul who’s also undergoing this task, please comment or contact me to connect. We may be able to make more sense of all this.