Grouping

Here we finally start using donorschoose.org data.

As you know already, donorschoose.org is a crowdfunding website public school teachers can create projects with a monetary goal that help teach their students. Donors make donations to those projects, some of which meet the goal, and some don't.

From now on, you will pretend that donorschoose.org is your business, and that you want to improve the business in any sense. How do you want to improve donorschoose.org? What information will you need from the data to achieve the improvement? That's the first question you have to answer to complete Assignment 2 (in group).

GROUP BY

Let's assume that I want to have projects for high-poverty areas more funded. To do so, I probably need to know the characteristics of the projects for high-poverty areas. How do you know this? One good way of doing this is comparing projects for high-poverty areas with other projects. To do so, we should group the data by the poverty_level of the areas for which projects are created. Do we have poverty_level column in our projects table? Yes:

mysql> SHOW COLUMNS FROM projects;

How do you group rows into poverty_level? You should do this:

GROUP BY poverty_level

Now, say that we are interested in averages of the size of donations, the number of projects, the number of donors. You are doing this:

mysql> SELECT poverty_level, count(*) as num_projects, AVG(total_price_excluding_optional_support),  AVG(num_donors)
        FROM projects
        GROUP BY poverty_level;

Let's say we are also interested in the completion rate:

mysql> SELECT sum(funding_status='completed') / count(*)
        FROM projects
        GROUP BY poverty_level;

Conditions with GROUP BY

Above might be wrong because there are projects that are still alive:

mysql> SELECT sum(funding_status='completed') / count(*)
        FROM projects
        WHERE funding_status <> "live"
        GROUP BY poverty_level;

Say you are not interested in the groups whose cases are less than 500:

mysql> SELECT poverty_level, sum(funding_status='completed') / count(*)  as completion_rate, count(*) as case_num
        FROM projects
        WHERE funding_status <> "live"
        GROUP BY poverty_level
        HAVING case_num > 500;

Last updated

Was this helpful?