Background: In assessing communication outcome in people following stroke, the emphasis has moved from impairment to its consequent effects on functional activity and participation in society. Alongside this has come an increasing focus on conversation. Conversation is a socially vital tool but its evaluation by speech and language therapists is not yet routine; detailed conversation analysis is time consuming and not easily quantified. In addition to the practical problems of assessing conversation, there are questions about its reliability as a basis for measurement. Aims: Our context was the need for a measure of functional communication within a large-scale randomised controlled trial of therapy for people with communication difficulty after stroke. The aim was to test the reliability of a clinically feasible procedure for collecting and rating a conversation sample. Methods and Procedures: Participants were 102 people who had had a stroke causing communication problems (aphasia and/or dysarthria) within the last 4-12 months; mean age 68 years; previously fluent English speakers; no pre-existing progressive dementia or learning disability. Participants were videoed in conversation with an unfamiliar partner following a framework script, which was used as necessary to obtain a 10-minute sample. Each participant was videoed twice within a 2-week period. Videos were rated by unfamiliar specialist SLTs using the aphasia/dysarthria activity scale of the Therapy Outcome Measure (TOM). Measures of intra-rater, inter-rater, and conversation sample reliability were obtained. Outcomes and Results: Intra-rater agreement was high; 93% of ratings were within a half point of each other on the TOM scale. The intra-class correlation (ICC) for intra-rater agreement was 0.92. Inter-rater agreement was slightly lower with 77% of ratings within a half-point on the 6-point scale; ICC was 0.83. Conversation reliability was equally good; 78% of ratings were within a half-point on the 6-point scale; ICC was 0.82. With training in the use of the TOM rating scale, the expectation is for even higher levels of agreement. Conclusions: Our findings support the use of the TOM activity scale by an unfamiliar observer to rate a short conversation as part of outcome measurement. The use of independent expert SLTs to provide TOM activity level ratings on structured conversation samples with an unfamiliar partner reduced the variability known to affect judgements of conversation, and was shown to have promise as a clinically feasible, socially relevant and reliable measure.