Sub-Navigation
Speech to Tenth Annual Roundtable Conference
Assessment, Reporting and Technology - System-wide assessment and reporting in the 21st century
24 October 2005
I approach the subject of the application of technology to system-wide assessment and reporting from the point of view of someone with an interest in the management of large-scale education systems, and in the regulation and delivery of national tests and examinations.
From that perspective, the key strategic issue is one of public sector procurement - the purchasing of a set of technologically based solutions to meet a specific purpose.
Public sector procurement is a term that conjures up images of politically fraught issues such as the purchase of the Eurofighter aircraft or the Collins Class submarine, rather than the replacement of paper and pencil tests.
In the world of education, technology for assessment and reporting is the third of three potentially transformative but still incomplete major reforms in the procurement of goods and services by large education systems in the last twenty years.
One is the private funding of public infrastructure, or PFI. Following the model adopted in roads, hospitals and prisons, education systems have variously turned to the private financial sector to build, service and even operate schools and colleges, with the return on investment being generated by multiple use of the facilities on a commercial basis.
Outsourcing of the provision of goods and services is the second: school meals, the management of national strategies in literacy and numeracy, counselling and career advice, school maintenance, school security, courier services and so on.
Both have real potential for good; both involve risk; both need high-quality strategic management; both require, as an essential condition for success, the building of a political constituency for reform; and both have brought benefits and grief.
The application of technology in education has been rather more enthusiastically welcomed by teachers than have been PFI and outsourcing. But that welcome should not be uncritical. In my view, the procurement of technologically based systemic assessment and reporting systems is potentially as fraught as any other area of public sector procurement. In these remarks, I therefore want to focus on the opportunities technology offers and the risks it presents; on how the first might be maximised and the second ameliorated; and on the strategies that governments and systems might find most effective in achieving the benefits of reform.
I begin with the market-place of goods and services to deliver system-wide assessment and reporting in the 21st century. The suppliers have set out their stalls. This is what you can buy.
First, there is on-screen testing.
- The most basic is straight paper translation - the replication of a paper test on a computer screen. This is a 'quick win' with some benefits. It requires no change in the design of the test or the setting of questions, yet it eliminates the printing and physical distribution of test papers, and it allows flexibility in the location and time of testing.
- Or you can buy tests which, like paper tests, use closed tasks or questions for which there is a known and pre-specified right answer, but which also use video or audio clips, drag and drop actions, oral language, or prompts and clues to assist struggling students - any sorts of modifications which are not replicable on paper.
- Transformational on-screen tests are radically different. They exploit the capabilities of interactivity between IT, data and media, using scenario-based environments and a 'virtual world' data set, and include open tasks that have no predetermined answer or solution. They require candidates to demonstrate their ability through process, as well as producing a range of outcomes.
At the stall just down the road you can buy three different ways of managing the test questions or test items.
- If you want the tests to be available beyond a one-off bounded time slot, you need an item bank from which tests and tasks can be selected, to avoid the potential for duplication.
- You need randomised selection or version control, so that schools and candidates cannot predict the content of the test - something new and different always comes out of the bank.
- And if you want to move away from simple paper translation, you need to develop new types of question items and tasks. New skills are required for the development of items with modifications not replicable on paper, and especially for the development of items for transformational on-screen testing. And that is because the question-setting role of the examiner, and the software development role of the supplier of technical solutions, have merged to become one activity.
Nearby is a third stall that specialises in the management not of tests and examinations, but in the processing and management of coursework submitted as part of the overall assessment in a subject.
- At its simplest, this consists of the submission and marking of coursework submitted in computer or multi-media format, and the issuing of results. It can include digital pictures of artwork, sound, video presentations of performances, documents and spreadsheets.
- At a more sophisticated level, you can buy ongoing portfolio management, enabling access to the portfolio by assessors and verifiers, and maintenance of a personal learning record.
Next door is a very large stall, which sells one product only: on-screen marking. Their particular skill is to scan written answers electronically, to present the answers on computer to external markers; and to gather and process the results. Their product is strong and convincing:
- better quality marking, through early detection and remediation of aberrant marking;
- random distribution of scripts and items to markers;
- specialisation of markers in a limited number of items;
- reduction of clerical errors, because the computer sums the marks;
- elimination of paper distribution; and
- greater security.
The second last of the market stalls is largely owned by the on-screen markers, and depends upon electronic capture of candidates' responses. It sells automated marking in three versions. Automated marking reduces the need for markers and allows very rapid generation of results.
- The simplest and most widely used is automated, multiple choice marking done by machine.
- The second is automated short-answer marking, generally using character recognition software.
- The third is process-based marking. It requires increasing intelligence in the marking engine and redesign of the tests to make the tasks and responses more amenable to automated marking. There is not much of it actually on the shelves at the moment, but we are promised that more is soon to come.
And finally, we have the sixth stall, which sells data capture, processing and reporting at a systemic level.
- You can buy systems which hold data on all units, subjects and qualifications in your national qualifications framework, on rules of combination, on prerequisites, syllabuses, assessment methods and so on.
- There are systems which allow individual learners to assemble e-transcripts of their results and qualifications, across the various agencies responsible for the assessment, awarding or accreditation of its components.
- And there are systems of data capture and reporting which generate data in forms useable by governments and school authorities to monitor and report performance, to allocate resources, to plan interventions, and to set targets for the future.
Now, there is delight and promise in this market, but there is also the top-level risk of buying something which creates more problems than it solves. A technology solution is for ever, not just for Christmas.
And despite the huge catalogue of models and options, and the contractual arrangements with suppliers to design, build and operate a bespoke solution to meet the specific needs of your system, it is still all a bit like getting home from Ikea. You have the instructions, the bits and pieces and the Allen key; if you find that the bookcase with seven shelves would have been better than the one with eight, then the only available solution involves more time, trouble and expense.
So, the headline strategic consideration for governments, assessment authorities, examination boards and regulators is to be clear about they want the technology to do, why they want it done, how they want it done, and how the outcome is to be expressed.
I now turn to three areas of risk that need to be considered by ministers, by government and non-government school systems, and by curriculum and assessment authorities, when considering the procurement of some of this technology.
Regulation and quality assurance
Pencil and paper has been the only assessment technology in use in public examinations since education became compulsory in most Western societies in the second half of the nineteenth century. It has worked for well for over 150 years, and in school education the students and their parents have confidence in it.
Reform and the supply side
The push for greater use of technology in assessment is not coming from parents who want their children to get into Law at Melbourne or Medicine at Edinburgh. Nor is it coming from work-place based learners who want on-the-job assessment and verification of performance. Despite the support we get from many when we explain the virtues of the new assessment and reporting technology, we need to recognise that this is not a demand driven reform, but essentially a supply side phenomenon. The push for reform is coming from the curriculum and assessment authorities, from the test development agencies, from the software developers and suppliers, from the hardware suppliers, from the school systems, from the education profession, from the organisations represented here today.
A fundamental strategic imperative of those of us in the vanguard of this reform must therefore be to bring the demand side with us. And the demand side includes not only students and their parents, business and industry, employers, and indeed the media, but also governments, which in certain respects are the most conservative of the demand-side interests.
Regulation and the demand side
The purpose of the regulation of assessment is essentially to protect the interests of the learner - the interests of the demand side - in matters of the maintenance of standards, the fairness of the examinations, the quality of the marking, the quality of the grading, and the national and international standing of the credential which is the outcome.
In England, the separation of the demand side from the supply side is clear. The syllabuses, assessment methods, marking and grading processes and issuing of results for the general qualifications are the responsibility of the five major UK awarding bodies, monitored by the QCA and the regulatory authorities in Wales and Northern Ireland, according to a very detailed Code of Practice which has the fundamental objective of providing a fair deal for every learner. The awarding bodies certainly have the interests of learners at heart, but they are squarely part of the supply side; the demand side champions are the QCA and the other regulatory authorities.
In Australia, there is no market to be regulated in the same sense. That has certain advantages and disadvantages. The curriculum and assessment authorities carry out the supply side functions of setting the syllabuses, setting the examinations, marking and grading the papers, and publishing the results. The extent of their further identification as the independent champion of the demand side depends upon their reputation and their governance arrangements. All of them have an in-house equivalent of the UK Code of Practice to regulate their own procedures and assure quality. They function in a significantly different environment from that in the UK, where increased levels of student performance are routinely seen by some sections of the media and the community as evidence of 'dumbing-down'.
For governments in the process of procuring or evaluating the case for procurement of technology-based assessment, there are two issues to consider in relation to regulation and quality assurance.
Security of the supply chain
The first is the security of the supply chain. Large-scale paper and pencil testing is a huge logistical operation, but it depends upon a quite linear and straightforward supply chain which runs from setting the paper to publishing the results. Outside suppliers might be contracted to provide some specific and discrete logistical steps in that process, e.g. data capture and processing. The awarding bodies in the UK are accountable for the actions of those suppliers, as are the assessment authorities in the Australian states and territories.
The effect of technology is not to make the supply chain run more smoothly, but to change it. Much effort in the current system is focused on the post-test stages such as marking, moderation, re-marking, grading and the processing of results; there is less upfront in test development. While not downplaying the critical importance of test development and pre-testing, most of the effort in terms of hours and labour is squeezed into the last two months.
Technology changes this. Both automated marking and on-screen marking reduce the major 'back-end' logistical effort in managing answer sheets and scripts; automated marking also removes the dependence on high volumes of markers, and removes the processes of marker recruitment, standardisation and grading.
At the same time, there is much more work to be done at the 'front-end': pre-testing, analysis of pre-test responses, production of the test items, and establishing the quality, standards and comparability of individual items. It is necessary to know how each item will perform before the test goes 'live'; the items are grade related, and there will no opportunity for their performance to be adjusted later through an awarding or grading process.
Further, there will be increased and alternative routes through the system, depending on which e-assessment components a qualification uses. For example, one assessment may comprise four different elements: multiple choice, short answer, voice capture and essay-type responses, each of which may follow different automated and non-automated marking routes. Although each individual route might be simpler than the paper-based system, there is more complexity overall because of the combination of e-assessment components and the increased process options.
The present simplicity of a contracted supplier providing a specific good or service to an awarding body or assessment authority is thus much harder to maintain in the new world of technology. The e-assessment system partners will include e-test design suppliers, e-test service providers, e-portfolio service providers, technical support providers, scanning bureaus and testing centres - all jointly needing to be involved in many aspects of test design, usability, marking design and issues of comparability. Actions of all these players have the potential to impact on the final assessment for candidates. As the number of organisations and the complexity of their actions and responsibilities increases, how is government to ensure that it all works in the interests of the demand side - that is, that it ensures a fair deal for the learner? Clearly, there needs to be an e-Code of Practice.
Pressure to produce such a code comes not only from the need to protect the interests of the demand side, but because of the potential proliferation of incompatible products. An increasing number of awarding bodies and suppliers are developing e-tests and delivery solutions. Some delivery solutions are subject to exclusivity agreements with a single awarding body; others are not compatible for content, in that they are test items written for one supplier’s software only; others are not interoperable, in that the e-answers are not compatible between e-tests. No country wants to end up with a Betamax of an assessment system, if an equivalent to
VHS has become the international norm.
This is widely recognised in the UK. There is work being done on an emerging, voluntary Question and Test Interoperability Standard (QTI); there is BS7988, which sets out some minimum requirements for organisations that use computers to make assessments; and there are standards regarding data security. But it is all pretty preliminary: there is still uncertainty about which commercial standard or product to back; there is no national 'preferred' delivery system; some awarding bodies understandably are deferring investment, in the expectation that a regulatory environment - or preferred national system - might be retrospectively imposed.
Quality assuring the ether
The notion of a national delivery system raises the second issue relating to regulation and quality assurance that needs to be considered in the procurement exercise. Can there be such a thing as a national delivery system in an environment in which e-learning and e-assessments are available electronically anywhere on the planet? How can one quality assure the ether?
Obviously, there cannot be national delivery systems, in the sense of monopolistic provision. The internet makes that impossible. But there can and must be regulation in the interests of quality assurance. There is potential in examining whether the appropriate strategy for that might be through cooperation by groups of nations such as OECD countries, to use their own national qualifications frameworks to that end.
Much excellent work is being done at the moment to build a European Qualifications Framework, which is a meta-framework through which the national qualifications frameworks of the various European countries can articulate. While envisaging each country maintaining its own framework for the qualifications delivered within its own jurisdiction, it will allow them to be mapped across to other frameworks and hence promote transferability and portability of qualifications and skills across Europe.
An appropriate strategy for e-regulation might be for each country to include in its qualifications framework those qualifications which have content and assessments originating from outside its jurisdiction, and which have been quality assured by the country itself, or by a partner country which is recognised as having similarly robust quality assurance procedures. Thus, qualifications which are currently distinctively English or Australian or German might be taught electronically and assessed electronically in a range of countries, along with a vastly increased range of vendor qualifications.
A levels at Wagga High School? The VCE at City of Westminster College? Are we up for that yet? Governments I suspect would see in it both risk and opportunity, depending on where they believe their present qualifications are positioned in the international food chain. It would be green light regulation - as distinct from red light regulation - to a degree never contemplated before. But as the status quo is no longer an option, what might be another alternative?
Logistics of scaling up
Scaling up from pilot to full operation is hugely demanding in both human and technical capability, and immensely harder than scaling up a paper-based system. Issues such as the protection of databanks from hackers and viruses; meeting accessibility and special needs requirements; providing business continuity and recovery arrangements in the event of ICT failure; and ensuring availability of the necessary hardware, software and technical skills for the delivery of e-assessment in schools, all need to be addressed in the scaling up process. And while the dimensions of these issues are understood, we have yet to see them resolved.
Standards and comparability
Governments and school systems are extraordinarily sensitive to even minor shifts in school performance figures.
The assessment authorities are expected by governments to hold examination and test standards absolutely constant from year to year. The height of the hurdle must not be raised nor lowered. The inevitable margin of error inherent in all assessment is held to an absolute minimum.
At the same time, the school system authorities are properly expected by governments to steadily increase levels of achievement, that is, the performance standards. Their job is to ensure that the number of young people leaping the hurdle increases year on year. Governments set targets for improvement in performance. Even small annual increases in performance are claimed as evidence of improved systemic performance; any decrease in performance is generally attributed to the assessment authority having raised the examination standard. In my own experience, any fall in performance from one year to the next has been well within the expected margin of error, as have been most annual increases in performance.
Transformational on-screen testing
It is of course extraordinarily difficult to compare the results of transformational tests with existing tests. They measure different things in different ways. Their development therefore will be directed to the assessment of new qualifications, or to assessing a current syllabus in a new and different way. Time-series comparisons of performance cannot continue once a test of this type has been introduced.
Further, it not yet clear how a new time-series will be established using the results of transformational on-screen testing as a basis for comparison year on year, or for comparison of the results for the same subject in the same year between one assessment authority and another. That is of particular significance for statutory and high stakes assessments. It seems likely that for most large systems, which give quite legitimate priority to valid year on year comparisons in high stakes summative assessments, the difficulty of ensuring comparability will restrict the speed of implementation of forms of on-screen testing other than paper translation.
While this consideration might set the pace of adoption, it will not prevent it. As will be shown in the presentations and discussions which are part of this conference, there are abundant examples now of on-screen testing with modifications not replicable on paper. The new technology makes it possible to evaluate skills and knowledge not possible with existing paper and pencil tests.
What then do we say about the politically charged issue of comparability of standards and grades over time? Is a grade B awarded in 2005 the same as a grade B in 2010, or better? It is imperative for ministers that school authorities and assessment authorities provide clear advice on that issue right from the start, and that that we spend time now translating the mysteries of the measurement industry into explanations which are understandable to the public.
On-screen marking
On-screen marking is in one sense the simplest and most benign of technological interventions. Instead of retrieving a rain-soaked copy of The Age or The Guardian from the letterbox or the doorstep, we can now have much of it delivered on-line. So, during the marking season - which in England is shortly before the grouse season - the kitchen tables of heads of mathematics in Hampshire and deputy principals on the Mornington Peninsula need no longer be littered with piles of scripts, but equipped with no more than a personal computer. No longer need bundles of scripts be delivered by post; the electronic distribution of written responses to questions is random; and markers deal with responses to particular questions, rather than mark each script as a whole. The latter means that the mark scheme for each item is followed more closely, because there is no capacity for a marker to be influenced by the global 'feel' of the script as a whole.
Therein lies an issue. On-screen marking against specified mark schemes, with early detection and correction of aberrant marking, greatly increased second marking, and the automatic addition of marks assigned to individual components of assessment, is likely to produce a result which is more valid and reliable than manual marking, except in situations where second examining and close supervision by senior markers is just as intense.
Being aware of the error inherent in manual marking, governments and education authorities seek to ensure that students are not disadvantaged. In some systems, this means that a second look is taken selectively at the quality of work of those who are just a mark or two below a grade, to ensure that they have not been deprived of the score they deserve. The second examiners commonly and understandably give many youngsters the benefit of the doubt. Such selective and targeted second examining will ensure that no student just below a grade is unfairly denied the grade, but it is not a process of overall quality control. It causes the otherwise smooth mark distribution curve to become serrated into a series of minor troughs and peaks at the grade boundaries.
The greater accuracy of on-screen marking means that such a check is no longer necessary. It also means that the results obtained by on-screen marking will be different to some small degree from results obtained from manual marking. That in turn means that there will be a disjunction in the time-series year-on-year with the change to on-screen marking. No matter how minor that 'blip' is, it will need to be explained to the public. The quite unfair allegation might be made that systems and governments have overstated previous performance levels, when in fact the results have been obtained by the best available processes at that time. Clearly, the adoption of on-screen marking needs to be accompanied by a communications and public relations strategy before and during implementation.
Public credibility
The third critical area to manage in the introduction of technologically based assessment is the issue of public acceptability.
Offshore marking
One of the properties of the new technology is that it can use the 24 hours in the day. Up till now, marking has been after work hours, sometimes during work hours, at weekends and in holidays. When scripts and marks are transmitted electronically, they can be marked around the clock in Surrey, Singapore and San Francisco.
Are we ready for that? Is England ready to have A Level Shakespeare marked in Toronto or Brisbane? For that matter, is any Australian state prepared to have another Australian state mark its history essays in answer to the question 'Who was the Father of Federation'?
There are clear financial imperatives for providers of electronic assessments to operate internationally. The investment required to support technology-based solutions is huge. Much of the innovation and assessment we are looking for will not be possible unless such investment is made. But it is also clear that no country can permit its marking to go offshore, unless it can account for the change in terms of it improving the quality of marking and thus being in the interests of the learner.
Our various jurisdictions therefore need strategies to explain to the public why it is of benefit to the learner that papers previously marked at home might now be partly marked overseas. In some countries a clear incentive is insufficient availability of local markers for the periods when they are required. It is patently in the interests of learners that marking is as accurate and reliable as possible; it is axiomatic that quality depends upon early detection and correction of poor marking; and it is clear that working electronically around the clock maximises the time which can be given to the marking process within the short national time frames that are generally the rule for completion of marking. The emphasis must be placed on the guarantee of quality rather than early return of results.
Critical to public credibility will be continued confidence in the process for reviewing and determining appeals against results, which is quite high with paper-based tests. In the e-environment - whether marked onshore or offshore - there will need to be a capacity to reproduce the electronic test responses of a candidate, to provide a proper audit trail of the marking and grading process, and to have the various test items and whole script reviewed. There will also need to be a capacity to return electronically to the school the scanned item responses that make up the script for each student, along with the marks assigned to each item. The maintenance of appeal procedures in their current transparency is a necessary condition for public acceptance.
Recruitment and training of markers
A second issue is the public perception of the marker: stereotypically a retired professor in a cottage in Kent or a mud-brick in Eltham, reading bundles of scripts and annotating each one in flourishes of red ink. It was always a convenient fiction. The providers of e-marking divide the item responses into those that do not require to be marked by an expert, and those that do. The first might be marked by computer or clerically. The latter will be marked by experts, but they might not be teachers, and they might not be graduates: they will be people trained in the use of the mark scheme who can demonstrate their capacity reliably to judge and discriminate in terms of it, and they will be monitored hourly or daily to ensure they are doing so.
I think there is immense opportunity to be far more positive about the quality and training of markers, and about the quality of the mark schemes now being used. From my experience of the awarding bodies in England that is certainly the case, and I’m sure the same is still true here. The ultimate public test is not whether the conventional stereotype is satisfied, but whether the quality of marking demonstrably continues to improve, as reflected in a decline in the number of appeals and in a decline in the proportion of those appeals actually upheld.
Change in the nature of the tests
It is commonly said that the increased opportunities for on-demand or when-ready testing will create a tendency for the delivery of smaller and smaller units of learning and assessment, and that it will therefore be imperative to test synoptically across whole sets of units to evaluate overall knowledge, competence and understanding. Similarly, the point is frequently made that the technology must serve the purpose of the test, and that there is a danger that in striving to make the tests computer based, some important objectives will be lost.
To us, these are obvious points and hardly worth making. We know how we will deal with them. But to the man on the Clapham omnibus, who is the arbiter of these things in Britain, and who has equivalents in every Australian state, we have yet not communicated our understanding of these quite valid fears nor given confidence that we can deal with them.
Is it worth the candle?
Now, the risk analysis I have presented is pretty heavy-duty. It is hard to shake off the legacy of eleven years as a NSW public servant. The question is, do the risks significantly outweigh the opportunities? Is it all worth the effort?
This conference is under the banner AR+t : Assessment, Reporting and Technology. Our strategic objectives are better served, I believe, by thinking instead of AR to the power of technology – that is, using technology to transform assessment and reporting, rather than applying it simply to the improvement our existing assessment and reporting models.
Across the ground I have covered, the areas of greatest risk lie in pursuing those strategies which are simply AR+t, that is, the addition of technology to existing applications and processes: paper translation, item banks, electronic coursework marking, on-screen item marking, automated marking.
All of these procurement items involve incremental changes in existing processes rather than the embracing of a new assessment and reporting paradigm. Some of the technologies will prove to be single generation technologies, as did the fax machine. Together, they raise red-risk questions about the security of the current assessment regime, once it becomes electronic; about the need for an electronic rather than paper-based Code of Practice; about how governments might explain the inevitable blip in the time series of national performance in a subject; about offshore marking; about the recruitment and training of markers; and about whether the tests are being developed to fit the technology rather than the technology to fit the tests. They also raise the question of whether there is any real long-term benefit in automating imperfect processes and imperfect assessments.
Most risk lies in the apparently easiest solutions.
There is much less risk, and immensely greater gain, in pursing strategies based on the concept of raising AR to the power of t: transformational onscreen testing; transformational question items and tasks; total learning portfolio management; process-based marking; and life-long learner access to systemic and personal data. There is no political downside in evaluating skills and knowledge not possible with existing pencil and paper tests, nor in establishing a new time series of performance targets against which to report them. Nor, provided such innovations are successfully piloted over time periods determined by the technical requirements and the provision of the necessary resources, is there any fundamental additional requirement for implementation other high level management skill.
In England, we are on track to have an on-screen transformational Key Stage 3 ICT test installed in all 4000 secondary schools, to assess the performance of all year 9 students on a statutory basis from 2008. It has already been trialled in 10 per cent of the schools. It represents a major change in the way tests are administered and how IT networks are configured and managed. It is set in a virtual world, and all marking is completed automatically. I know a great deal is also happening in the other countries from which there are people attending this conference.
I conclude by saying that the foundation of sound strategy in this area is to understand that it is simply not possible to move from AR+t to AR to the power of t. They represent quite separate and discrete products in the marketplace: one cannot migrate incrementally from one to the other. All of us have some immediate needs for the improvement of existing assessment and reporting, which can be satisfied by dipping into the AR+t basket, and managing the risk. But the real prize is in the other basket: the transformation of assessment and reporting, in the service of profoundly better education.
Is it worth the candle?
Absolutely.
Ken Boston
