Conceptual and Practical Challenges in InfoViz Evaluations

From BELIV 2010

Jump to: navigation, search

Presentation

Notes

- two groups of challenges, (1) design concept behind of the system we evaluate, (2) details that matter a lot. crucial to advance our field.

- getting much better at doing evaluations, but still concerns (eg munzner 2008 paper pitfalls in visualization)

- concerns are shared by research in usability evaluation (gray and salzman, 1998) and discussion of methodological concerns in CHI (eg. greenberg and buxton 2008).

- conceptual challenges

- win-lose evaluations: outcome mostly given before evaluation, so obvious beforehand, throw away well selected tasks, will lose

- point designs, huge sea of design possibility, pick one point and dive in, but no mapping or laying out or results into the science space

- lack of clarity in key constructs: paper who claims it evaluates fisheye but no real fisheye

- lack of theory-driven experiments: some in perceptual studies, but general, no real deep theories

Eg the concept of overview

- overview: definitions (read from slide), surprised there are relatively few definitions of overview

- bottom up approach to get definition: looked in 60 studies, 3 uses:

-- technical (overview as interface component e.g. O+D)

-- user centered (user forming an overview)

-- other (given an overview of...)

- does an overview give an overview? we have no clue, how to tackle this problem? need to do more of this style of evaluation

Eg Fisheye interfaces

- results is quite mixed: some optimistic, pessimistic (yes, Heidi and Tamara...)

- use fisheye in programming environments

- is a priori interest useful? may be but hard, hard for users to understand, useful a priori algo is complex but hard to understand relative to distance, have this interface concept has been around for 24-ish years, but mostly untested, few hints, but made very little progress at the concept level

Strong inference

- Platt 1964, notion of strong inference, with the steps

-- devising alternative hypotheses, equal proposal ways of going down any branch, use more theory-driven experiment that look at equally probably hypotheses. Very few results that report anything interesting--our great idea didn't work and didn't get published. Therefore need good theoretical motivation for either outcome is interesting.

-- devise a crucial experiment which will rule out some hypotheses

Radical solutions

- Newman 1994 reviewed CHI publication and found 25% radical solutions

- ran into a style of publication for radical style. Weird problem, nice interface, works in some way, find new weird problem and move on. Newman suggests carry on work, or verify work

Practical challenges - simple outcome measures

- mostly do binary task completion, or relatively simple error measures, only very few used expert grading of products, more interested in comprehension/learning in outcome

- simple process measures

- time is a summary of measure of the process, how should we interpret task completion time?

- eDoc study: 20% higher task completion time with O+D, tried to look at the place in the text people have visible and made progression maps

- further exploration: people found answers and do stuff, so what seems to spend more time with O+D, but ppt spent more time in further exploration, so O+D using more time that is not necc bad, promotes checking etc, made a case for much more complex interfaces

- standing on the shoulders of giants

- challenge to select data, tasks, measures, interfaces. People in this community mostly evaluate their own interfaces. Independent research found different effects than interested parties. At least fixing and generating resources for some of these, crucial for repeating experiments.

- people keep inventing new facets of satisfaction, but no clue how these relate (I feel, the system is, on a 1-5 scale)

- questionnaire use, reliability of questionnaire, those who use standardized, and homegrown, standards have higher reliability

Eg selecting tasks - tasks are crucial, effects are strong, other effects are less, task-level testing (Catherine)

- tasks are chosen ad hoc to match evaluation or habitually

- eg evaluate overview tasks? most studies used navigation tasks, started to conclude about psychological aspects. monitoring is rarely used

What to do? - more strong, theoretically motivated comparisons (eg. TILCS?)

- more complex measures of outcome, process, coupled with richer data

q and an: - more vigorous experimentation, but takes long, fast development of technology in our field, not modern is anymore, solutions? do experiment on interface that is not the most modern but embody essence of basic ideas (eg fisheye in programming environment). Not a factor experiment psycho. Also important in real work with studies of adoption and integration. Go for the more conceptual type of study.

- reproducibility of study results, evaluate own work (your baby) a bit taint, practical to build up two things? Dunno. Competition, don't evaluate your own system, pooled system (?). Would you accept a paper that replicate experiments as a reviewer? Large scale repetition of graph perception (Heer), made a contribution, then to mostly not work, have to be more clever. Hate experiments with only two interface, one will be your darling, experiments use more interfaces are more interesting. For any controlled experiment, you can bias in any way you want, but more so in uncontrolled ones

- will reject repetition papers, need separate venue. Novelty is cool and attract attention. Journal?

- not have enough info to repeat, make public data set and tasks we use to allow others to repeat experiments

- SEMVAST: all the dataset from VAST, with CHI97 tree browse off, use same dataset; have Spacetree code, data, and tasks

- higher level question: how would you extend proposal to corporate and industrial researchers? would have to test our own products, different to share data? No clue. Conceptual replication probably not relevant to industry? figure out if works in real life.

- replication issue, whole maturity of field, we could agree on fundamental issues or problems, push to some sort of replication framework of central problems, work on harder problems, field has to push to replicating results. High level problems that very few address. Know so little about interaction and visualization. How do you trade off one against another. Before replications, try to attack higher level problems.

- other domain like architecture, replication is not important, creating something new, hard to evaluate moving targets. Arch builds engineering principles, what are our principles?

- run experiments on representative vis (eg treemap), how to tease apart interaction and visualization? Have some clue of things to do, don't know enough about it to do. Programming environment, think aloud, what people want to do vs what they did. Not sure how to do it.

- situation awareness, well documented, interrupted people during task for them to say stuff, nice procedure

Personal tools