Guidelines for Evaluators

Thanks for your interest in evaluating research for the Unjournal!
Your evaluation will be made public and given a DOI, but you will have the option to remain anonymous or 'sign your review' and take credit. You will be compensated a minimum of $250 for your evaluation work, and will be eligible for financial 'most informative evaluation' prizes. See the guidelines below. You can submit your response in this form (Google Doc), and share it back with us.
Click HERE to directly make a new copy of this form.
If you wish, you can download the current page as a pdf HERE, with all folded boxes open. (updated 20 Mar 2023)
Payment amounts discussed, submitting claims
We aim to increase these payments going forward and are applying for funds to do this. "Pilot" reviewers will be given an extra payment for helping us evaluate the system. We may occasionally offer additional payments for specifically requested evaluation tasks.
How and why did we decide on these guidelines?

What we would like you to do

  1. 1.
    Write a review: a ‘standard high-quality referee report’, with some specific considerations
  2. 2.
    Give quantitative metrics and predictions as requested in the two tables below, as appropriate.
  3. 3.
    Answer a short questionnaire

Writing the report

In writing your report, please consider the following:
Specific requests for focus/feedback
Please pay attention to anything our managers/editors specifically asked you to focus on. We may ask you to focus on specific areas of expertise; you do not need to address all aspects of the work. We may also forward specific feedback requests from authors.
The Unjournal's criteria
For the most part, this is like a 'standard journal review', but we have some particular priorities. See Category explanations: what you are rating for guidance. For example, we would like to prioritize impact and robustness over cleverness.
Remember: this review (and ratings) will be made public
Unless you were advised otherwise, it will be given a DOI, and hopefully it will enter the public research conversation. Note that the authors will be given two weeks to respond to reviews, before the evaluations, ratings and the responses are made public. You will be given a choice of whether you want to be publicly listed as an author of the review.
If you have questions or clarifications about the authors’ work, you can ask them these questions anonymously; we will facilitate it.
Publishing and signing reviews: considerations/exceptions
We are considering the best policy towards signed reviews vs. single-blind reports; for now we give evaluators the option to choose. We may change this policy in the future. We are also considering the best 'single blind vs signed reviews' policies going forward. We may also make some exceptions to the public evaluations policy in the future; reviewers will be informed in advance.
  • We may give early-career researchers the right to veto the publication of very negative reviews. We will inform you in advance if this will be the case for your evaluation.
  • You can reserve some ‘sensitive’ content in your report to be shared with only the Unjournal management or only the authors, but we hope to keep this limited.
Suggestion to evaluators: The category-metrics andoverall-metrics-holistic-assessment outline our evaluation priorities. You may want to look at these metrics before writing your review, and then return to them afterwards.

A ‘standard high-quality referee report'

We are generally asking for a ‘standard high-quality referee report’ here; the sort of report an academic would write for a traditional high-prestige journal. We are asking for this, subject to some differences in priorities, which we discuss below, and subject to any particular requests the managing editor may communicate to you.
Length and time spent: This is of course, up to you. We welcome detail, elaboration, and technical discussion.
Length and time: possible benchmarks
The Econometrics society recommends a 2-3 page referee report; Berk et al suggest this is relatively short but that brevity is desirable. In a recent survey (Charness et al, 2022), economists report spending (median and mean) about one day per report, with substantial shares reporting ‘half a day’ and ‘two days’. We expect that that reviewers tend spend more time on papers for high-status journals, and when reviewing work closely tied to their own agenda.

Metrics: overall assessment, categories

Below: a 'completed example'. We will give evaluators a concise survey form with everything they need to fill out.
All metrics are explained below underwhat-we-are-asking-you-to-rate-explanations
*Note: "suggested weights"
**Note: Relevance to Global Priorities
For each question above, if it seems relevant, and you feel qualified to judge, please ...
  1. 1.
    Give a rating from 0-100, considering the ‘what we are asking you to rate’ discussion provided. Try to follow the scale in "0-100-metric-described" but specifically for this category.
  2. 2.
    Optional: ‘Quantify how certain you are’ about this, either giving a 90% confidence/credible interval or using our scale as described below. (Please give a 90% CI or a confidence score, but not both.)
Suggested: 'Calibration training' for judging your own uncertainty
You may find it useful and interesting to try the app discussed HERE.
"The aim of the web app is to help you become “well-calibrated.” This means that when you say you’re 50% confident, you’re right about 50% of the time; when you say you're 90% confident, you're right about 90% of the time; and so on." This may help you improve your judgment in stating 90% confidence and credible intervals over your own beliefs.

Overall assessment

We see the 'overall assessment' as the most important measure. Please prioritize this.
Judge the work’s quality heuristically. Consider all aspects of quality, importance to knowledge production, and importance to practice. As noted above, we give ‘suggested weights’ (0-5) to suggest the importance of each category rating to your overall assessment, given the Unjournal's priorities. But you don't need, and may not want, to use these weightings precisely.

0-100 Metric explained

The description folded below focuses on the "Overall Assessment". Please try to use a similar scale when evaluating the category metrics.
Top ratings (90-100)
95-100: Among the highest quality and most important work you have ever read.
90-100: This work represents a major achievement, making substantial contributions to the field and practice. Such work would/should be weighed very heavily by tenure and promotion committees, and grantmakers.
For example
  • Most work in this area in the next ten years will be influenced by this paper
  • This paper is substantially more rigorous or more insightful than existing work in this area in a way that matters for research and practice
  • The work makes a major, perhaps decisive contribution to a case for (or against) a policy or philanthropic intervention
Near-top (75-89) (*)
This work represents a strong and substantial achievement. It is highly rigorous, relevant, and well-communicated, up to the standards of the strongest work in this area (say, the standards of the top 5% of committed researchers in this field). Such work would/should not be decisive in a tenure/promotion/grant decision alone, but it should make a very solid contribution to such a case.
Middle ratings (40-59, 60-74) (*)
60-74.9: A very strong, solid, and relevant piece of work. It may have minor flaws or limitations, but overall it is very high-quality, meeting the standards of well-respected research professionals in this field.
40-59.9: A useful contribution, with major strengths, but also some important flaws or limitations.
Low ratings (5-19, 20-39) (*)
20-39.9: Some interesting and useful points and some reasonable approaches, but only marginally so. Important flaws and limitations. Would need substantial refocus or changes of direction and/or methods in order to be a useful part of the research and policy discussion.
5-19.9: Among the lowest quality papers; not making any substantial contribution and containing fatal flaws. The paper may be fundamentally addressing an issue that is not defined or obviously not relevant, or the content may be substantially outside of the authors’ field of expertise.
0-4: Illegible, fraudulent, or plagiarized. Please flag fraud, and notify us and the relevant authorities.
(*) 20 Mar 2023: We adjusted these ratings to avoid overlap
The previous categories were 0-5, 50-20, 20-40, 40-60, 60-75, 75-90, and 90-100. Some evaluators found the overlap in this definition confusing.

The confidence rating

What are we looking for and why?

In considering how to weigh any measure or evaluation, it is important to quantify the uncertainty. That's why we ask you to provide a measure of this. You may feel comfortable giving your "90% confidence interval", or you may prefer to give a 'descriptive rating' of your confidence (from 'extremely confident' to 'not confident').
"1-5 dots": Explanation and relation to CIs
5 = Extremely confident, i.e., 90% confidence interval spans +/- 4 points or less
4 = Very confident: 90% confidence interval +/- 8 points or less
3 = Somewhat confident: 90% confidence interval +/- 15 points or less
2 = Not very confident: 90% confidence interval, +/- 25 points or less
1 = Not confident: (90% confidence interval +/- more than 25 points)
Remember, we would like you to give a 90% CI or a confidence rating (1-5 dots) but not both.
Example of Confidence dots vs CI
The example in the diagram above (click to zoom) illustrates the proposed correspondence.

Category explanations: what you are rating

Note, all of these criteria are scales rather than binaries.

1. Advancing our knowledge and practice

Suggested weighting: 5
("To what extent"...) does the project make a contribution to the field or to practice, particularly in ways that will be relevant to our other criteria
Less weight to ‘originality and cleverness’...
‘Originality and cleverness’ should be weighted less than the typical journal, because the Unjournal focuses on impact. Papers that apply existing techniques and frameworks more rigorously than previous work and/or apply them to new areas in ways that provide practical insights for GP (global priorities) and interventions should be highly valued. More weight should be placed on contribution to GP than to the academic field.
Do the insights generated inform our (‘posterior’) beliefs about important parameters and about the effectiveness of interventions? Note that we do not require a substantial shift in our expectations.
Does the project leverage and incorporate recent relevant and credible work in useful ways?

2. Methods: Justification, reasonableness, validity, robustness

Suggested weighting: 5
Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? Are all of the given results justified in the 'methods discussion'?
Are the results/methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this (e.g., with at least a reasonable range of robustness checks, and at best ‘map the space’ of possible reasonable specifications)?
Avoiding bias and questionable research practices (QRP): Did the authors take steps to reduce bias from opportunistic reporting and QRP? For example, pre-registration, multiple hypothesis testing corrections, and reporting flexible specifications.
Notes on 'methods'
We use the term “methods” here broadly; this may include choice/collection of data, experiment or survey design, statistical analysis, and simulation, among others.

3. Logic and communication

Coherent and clear argumentation, communication, reasoning transparency
Suggested weighting: 4
Are the goals/questions of the paper clearly expressed? Are concepts clearly defined/referenced?
Is the reasoning ‘transparent’? (See Open Philanthropy's guide on reasoning transparency.) Are all of the assumptions and logical steps made clear? Does the logic of the arguments make sense? Is the argument written well enough to make it easy to follow?
Is the data and/or analysis presented relevant to the arguments made? Are the stated conclusions/results consistent with the evidence (or theoretical results/proofs) presented? Are the tables/graphs/diagrams easy enough to understand in the context of the narrative (e.g., no errors in labeling)?

4. Open, collaborative, replicable science and methods

Suggested weighting: 3
A. Replicability, reproducibility, data integrity
Would another researcher be able to perform the same analysis and get the same results? Is the method and its details explained sufficiently, in a way that would enable easy and credible replication? For example, a full description of analysis, code and software provided, and statistical tests fully explained. Is the source of the data clear?
Is the necessary data made as widely available as possible, as applicable? Ideally, the cleaned data should also be clearly labeled and explained/legible.
Optional: Are we likely to be able to construct the output from the shared code (and data)? Note that evaluators are not required to run/evaluate the code; this is at your discretion. However, having a quick look at some of the elements could be helpful. Ideally, the author should give code that allows easy, full replication, for example, a single R script that runs and creates everything, starting from the original data source, and including data cleaning files. This would make it fairly easy for an evaluator to check. For example, see this taxonomy of ‘levels of computational reproducibility’.
B. Consistency:
Do the numbers in the paper (and code output, if checked) make sense? Are they internally consistent throughout the paper?
C. Useful building blocks:
Do the authors provide tools, resources, data, and outputs that are likely to enable and enhance future work and meta-analysis?

5. Engaging with real-world, impact quantification; practice, realism, and relevance

Suggested weighting: 2
Does the paper consider the real-world relevance of the arguments and results presented, perhaps engaging policy and implementation questions?
Is the setup particularly well-informed by real-world norms and practices? “Is this realistic; does it make sense in the real world?”
Optional, desirable, invited:
Authors might be encouraged and should be rewarded for the following.
  • Do the authors communicate their work in ways policymakers and decision-makers are likely to understand (perhaps in a supplemental ‘non-technical abstract’), without being misleading and oversimplifying?
  • Do the authors present practical ‘impact quantifications’ such as cost-effectiveness analyses, or provide results enabling these?
In future we may be able to pay them to do the above, if grant funding permits..

6. Relevance to global priorities**

Suggested weighting: 0. Why? The management team has already considered this work and evaluated it as relevant to global priorities, before passing it to evaluators. Nonetheless, we would like your informed assessment (and discussion).
Is this topic, approach, and discussion potentially useful to global priorities research and interventions? Does it help us evaluate what to prioritize for interventions and policy, improve interventions and policy, or improve our research and knowledge capacity for these?

Journal/Prediction metrics

We would like to benchmark our evaluations against 'how research is currently judged.' We want to provide a bridge between the current 'accept or reject' system and an evaluation-based system. We want our evaluations to be taken seriously by universities and policymakers. Thus, we are asking you for two predictions in the table below.
Journal/Prediction metrics
Predict: journal quality* (0.0-5.0)
90% CI
Confidence (alt.)
What ‘quality journal’ do you expect this work will this be published in?
lower, upper
Overall assessment on ‘scale of journals'; i.e., quality-level of journal it should be published in.
lower, upper
*Note: 0= lowest/none, 5= highest/best. See below for some benchmarks and guidelines.
*To better understand what we are asking here, please consult the subsections below: "Journal metrics," "What quality journal...," and "Overall assessment on ‘scale of journals'"

Journal metrics

For the 'prediction’ questions above, we are asking for a ‘journal quality rating prediction’ from 0.0 to 5.0. You can specify up to 2 digits (e.g., “4.4” or “2.0”) We are using this 0-5 metric here (rather than 0-100) as we suspect it is more familiar to academics.
The metrics are:
0/5: Marginally respectable/Little to no value. Not publishable in any journal with scrutiny or credible WP series, not likely to be cited by credible researcher
1/5: OK/Somewhat valuable journal;
2/5 Marginal B-journal/decent field journal
3/5: Top B-journal/Strong field journal
4/5: Marginal A-Journal/Top field journal
5/5: A-journal/Top Journal
We give some example journals HERE that may correspond to the above, based on SJR and ABS ratings.

What ‘quality level journal’ do you expect this work to ultimately be published in?

What if this work has already been 'peer-review published'?
The question above presumes that this work has not already been published in a peer-reviewed journal. However, we are planning to commission at least some post-publication review going forward. If the work has already been ‘peer-review-published’ you can either:
  • Skip this question but please still answer the next prediction question, the Overall assessment on ‘scale of journals' or
  • Answer a related question (not a prediction): “Suppose this paper were submitted to journals, in succession, from the top tier downwards. Imagine there is some randomness in this process. Consider all possible “random draws of the world”. In the ‘median draw,’ what ‘quality level journal’ would this paper be published in?
"As if you were advising an author"
In presenting your prediction and confidence interval for this, you might want to consider if you were offering advice to an author:
“What journal would be likely to publish this work?”
“What is the most prestigious journal that would consider publishing this?”
“What is the least prestigious journal that the authors should consider submitting this to?" I.e., "I wouldn't go lower, even if I were risk-averse”
Reprising the confidence intervals for this new metric
From 'five dots' to 'one dot'...
5 = Extremely confident, i.e., 90% confidence interval spans +/- 4 points or less)*
4 = Very confident: 90% confidence interval +/- 8 points or less
3 = Somewhat confident: 90% confidence interval +/- 15 points or less
2 = Not very confident: 90% confidence interval, +/- 25 points or less
1 = Not confident: (90% confidence interval +/- 25 points)

Overall assessment on ‘scale of journals'

Consider the scale of journals described above. Suppose that:
1. the journal process was fair, unbiased, and free of noise, and that status, social connections, and ‘lobbying to get the paper published’ didn’t matter, and
2. journals assessed research according to the category metrics we discussed above.
In such a case, what ‘quality level journal’ would and should this research be published in its current form or with minor revisions?

Survey questions

For the questions below, we will publish your responses and review unless you ask us to keep them anonymous.
  1. 1.
    How long have you been in this field?
  2. 2.
    How many proposals and papers have you evaluated? (For journals, grants, and other peer-review.)
Your answers to the questions below will not be made public:
  1. 1.
    How would you rate this template and process?
  2. 2.
    Do you have any suggestions or questions about this process or the Unjournal? (We will try to respond, and incorporate your suggestions.) [Open response]
  3. 3.
    Would you be willing to consider evaluating a revised version of this project?

How to write a good review (general conventional guidelines)

Some general key points to consider
  • Cite evidence and reference specific parts of the research when giving feedback.
  • Try to justify your critiques and claims in a reasoning transparent way, rather than merely ‘passing judgment’.
  • Provide specific, actionable feedback to the author where possible.
  • When considering the authors’ arguments, consider the most-reasonable interpretation of what they have written (and state what that is, to help the author make their point more clearly). See ‘steelmanning’.
  • Be collegial and encouraging, but also rigorous. Criticize and question specific parts of the research without suggesting criticism of the researchers themselves.
We are happy for you to use whichever process and structure you feel comfortable with when writing a peer review.
One possible structure:
  • Assign an overall score based on quantitative metrics (possible: brief discussion of these metrics).
  • Summarize the work and issues, and the research in context to convey your understanding and help others understand it.
  • Highlight positive aspects of the paper, strengths and contributions.
    • Assess the contribution of the work in context of existing research.
  • Note major limitations and potential ways the work could be improved; where possible, reference methodological literature and discussion, and work that ‘does what you are suggesting’.
  • Discuss minor flaws and their potential revisions.
    • You are not obliged/paid to spend a great deal of time copy-editing the work. If you like, you can give a few specific suggestions and then suggest that the author look to make other changes along these lines.
  • Offer suggestions for research agendas, increasing the impact of the work, incorporating the work into global priorities research and impact evaluations, and enhancing future work.
Remember: The Unjournal doesn’t “publish” and doesn’t “accept or reject”. So don’t give an “Accept, Revise and Resubmit, Reject, etc. ” recommendation. We just want quantitative metrics, some written feedback, and some relevant discussion.
'This paper is great, I would accept it without changes, what should I write/do?
We still want your evaluation and ratings. Some things to consider as an evaluator in this situation.
  1. 1.
    We still want your quantitative ratings and predictions
  2. 2.
    A paper/project is not only a good to be judged on a single scale. How useful is it, and to who or what? We'd like you discuss its value in relation to previous work, it’s implications, what it suggests for research and practice, etc.
  3. 3.
    Even if the paper is great...
    • Would you accept it in the “top journal in economics”? If not, why not?
    • Would you hire someone based on this paper?
    • Would you fund a major intervention (as a government policymaker, major philanthropist, etc.) based on this paper alone? If not, why not
  4. 4.
    What are the most important and informative results of the paper?
  5. 5.
    Can you quantify your confidence in these 'crucial' results, and their replicability and generalizability to other settings? Can you state your probabilistic bounds (confidence or credible intervals) on the quantitative results (e.g., 80% bounds on QALYs/DALYs/or WELLBYs per $1000)
  6. 6.
    Would any other robustness checks or further work have the potential to increase your confidence (narrow your belief bounds) in this result? Which?
  7. 7.
    Do the authors make it easy to reproduce the statistical (or other) results of the paper from shared data? Could they do more in this respect?
  8. 8.
    Communication: Did you understand all of the paper? Was it easy to read? Are there any parts that could have been better explained
    • Is it communicated in a way that would it be useful to policymakers? To other researchers in this field, or in the general discipline?

Writing referee reports: resources and benchmarks

Open Science
PLOS (Conventional but open access, simple and brief)
Peer Community In... Questionnaire (Open-science-aligned, perhaps less detail-oriented than we are aiming for)
Open Reviewers Reviewer Guide (Journal-independent “Pre-review”; detailed, targets ECRs)