Aggregating patent data at regional level

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

misslord
Posts: 4
Joined: Wed Feb 22, 2023 6:19 pm

Aggregating patent data at regional level

Post by misslord » Tue Mar 07, 2023 11:00 am

Hi,
I would like to have some clarifications about the following query with which download patent data aggregated at regional level for European countries in selected IPC classes:

Code: Select all

SELECT a.appln_filing_year, p.nuts, nuts_label, COUNT(a.appln_id) AS NumberOfPatentApplications
FROM tls201_appln a
JOIN tls207_pers_appln p_a on a.appln_id = p_a.appln_id
JOIN tls206_person p ON p.person_id = p_a.person_id
JOIN tls209_appln_ipc i ON a.appln_id = i.appln_id
JOIN tls904_nuts n on p.nuts = n.nuts and N.nuts_level <=3
WHERE i.ipc_class_symbol in ('B60L   7/00', 'B60L   7/02', 'B60L   7/04', 'B60L   7/06', 'B60L   7/08', 'B60L   7/10', 'B60L   7/12', 'B60L   7/14', 'B60L   7/16', 'B60L   7/18', 'B60L   7/20', 'B60L   7/22', 'B60L   7/24', 'B60L   7/26', 'B60L   7/28', 'B60L  50/00', 'B60L  50/10', 'B60L  50/11', 'B60L  50/12', 'B60L  50/13', 'B60L  50/14', 'B60L  50/15', 'B60L  50/16', 'B60L  50/20', 'B60L  50/30', 'B60L  50/40', 'B60L  50/50', 'B60L  50/51', 'B60L  50/52', 'B60L  50/53', 'B60L  50/60', 'B60L  50/61', 'B60L  50/62', 'B60L  50/64', 'B60L  50/70', 'B60L  50/71', 'B60L  50/72', 'B60L  50/75', 'B60L  50/90', 'B60L  53/00', 'B60L  53/10', 'B60L  53/12', 'B60L  53/122', 'B60L  53/124', 'B60L  53/126', 'B60L  53/14', 'B60L  53/16', 'B60L  53/18', 'B60L  53/20', 'B60L  53/22', 'B60L  53/24', 'B60L  53/30', 'B60L  53/302', 'B60L  53/31', 'B60L  53/34', 'B60L  53/35', 'B60L  53/36', 'B60L  53/37', 'B60L  53/38', 'B60L  53/39', 'B60L  53/50', 'B60L  53/51', 'B60L  53/52', 'B60L  53/53', 'B60L  53/54', 'B60L  53/55', 'B60L  53/56', 'B60L  53/57', 'B60L  53/60', 'B60L  53/62', 'B60L  53/63', 'B60L  53/64', 'B60L  53/65', 'B60L  53/66', 'B60L  53/67', 'B60L  53/68', 'B60L  53/80', 'B60L  55/00', 'B60L  58/00', 'B60L  58/10', 'B60L  58/12', 'B60L  58/13', 'B60L  58/14', 'B60L  58/15', 'B60L  58/16', 'B60L  58/18', 'B60L  58/19', 'B60L  58/20', 'B60L  58/21', 'B60L  58/22', 'B60L  58/24', 'B60L  58/25', 'B60L  58/26', 'B60L  58/27', 'B60L  58/30', 'B60L  58/31', 'B60L  58/32', 'B60L  58/33', 'B60L  58/34', 'B60L  58/40', 'B63B   1/00', 'B63B   1/02', 'B63B   1/04', 'B63B   1/06', 'B63B   1/08', 'B63B   1/10', 'B63B   1/12', 'B63B   1/14', 'B63B   1/16', 'B63B   1/18', 'B63B   1/20', 'B63B   1/22', 'B63B   1/24', 'B63B   1/26', 'B63B   1/28', 'B63B   1/30', 'B63B   1/32', 'B63B   1/34', 'B63B   1/36', 'B63B   1/38', 'B63B   1/40', 'C10L   5/40', 'C10L   5/42', 'C10L   5/44', 'C10L   5/46', 'C10L   5/48', 'C12P   7/00', 'C12P   7/02', 'C12P   7/04', 'C12P   7/06', 'C12P   7/08', 'C12P   7/10', 'C12P   7/12', 'C12P   7/14', 'C12P   7/16', 'C12P   7/18')
AND left (n.nuts,2)  in ('NL', 'EE', 'BE', 'FR', 'ES', 'PT', 'UK', 'IE', 'IS', 'NO', 'SE', 'FI', 'LV', 'LT', 'PL', 'DE', 'LU', 'CH', 'AT', 'IT', 'CZ', 'SK', 'HU', 'SI', 'HR', 'RO', 'RS', 'BG', 'ME', 'MK', 'AL', 'EL', 'TR', 'CY', 'LI', 'MT')
AND invt_seq_nr > 0  -- only INVENTORS
AND a.appln_filing_year BETWEEN 2010 AND 2022
GROUP BY a.appln_filing_year, p.nuts, nuts_label
ORDER BY a.appln_filing_year, p.nuts, nuts_label
Firstly, I am not able to access data at the nuts2 level because if I change the nuts2 level in 2, it substitutes only country data (for ex., nuts= AT).
But more important, given that I get also country aggregate data, I have noticed that the number of applications for a country is not equal to the sum of the number of applications of its provinces.
So, I would like to understand if there is a missing assignment of patents to EU regions and it is more conveniente for me download data at micro level.

Thank you in advance.


EPO / PATSTAT Support
Posts: 426
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Aggregating patent data at regional level

Post by EPO / PATSTAT Support » Tue Mar 07, 2023 6:19 pm

Hello misslord,
there are a couple of issues that lead to your results. From a syntax point of view, your query is correct (no errors when you run it), but it does not give you the expected results.
One has to keep in mind that when you join tables in SQL, the database simply makes a bigger and bigger table taking into account all the data elements of all the tables that you have joined.
As an example, lets just look at 1 single application which has 65 Inventors. If you run the query below, you will see that it generates 65 rows; one row for each of the 65 inventors.

Code: Select all

SELECT *
FROM tls201_appln a
JOIN tls207_pers_appln p_a on a.appln_id = p_a.appln_id
JOIN tls206_person p ON p.person_id = p_a.person_id
WHERE invt_seq_nr > 0  -- only INVENTORS
and a.appln_id = 527685245
Analysing the results, you will see that for 4 inventors, we do not have a NUTS code because they don't have an address in Europe. When I now join the tls904_nuts table to get the names of the NUTS areas, you will see that those inventors are gone. (we can add them again by using a "LEFT JOIN tls904_nuts n on p.nuts = n.nuts", which would then create so called NULL values for those data items that do not have a match in the joined table.

Code: Select all

SELECT *
FROM tls201_appln a
JOIN tls207_pers_appln p_a on a.appln_id = p_a.appln_id
JOIN tls206_person p ON p.person_id = p_a.person_id
JOIN tls904_nuts n on p.nuts = n.nuts 
WHERE invt_seq_nr > 0  -- only INVENTORS
and a.appln_id = 527685245 
You could in principle now do a manual count on the result table and see how many inventors we have for each NUTS code. And if we are counting applications, we would count the (1 !) application for each of the nuts codes.

Code: Select all

SELECT p.nuts, nuts_label, count(a.appln_id) applications
FROM tls201_appln a
JOIN tls207_pers_appln p_a on a.appln_id = p_a.appln_id
JOIN tls206_person p ON p.person_id = p_a.person_id
JOIN tls904_nuts n on p.nuts = n.nuts 
WHERE invt_seq_nr > 0  -- only INVENTORS
and a.appln_id = 527685245
group by p.nuts, nuts_label
ORDER BY applications desc
Looking at the result, as a researcher you should ask yourself if this is what you want. (we are not yet looking at the different NUTS levels.) You can see that the sum of all the counts = 61,one for each inventor for whom we have a NUTS code. (If you use a LEFT JOIN you will see that there are 4 inventors that we can not assign to a NUTS code. .-->NULL values.)
There are 2 ways to do it differently. You can give a single 1 count to each NUTS area where there is at least 1 inventor. That is simply done by changing the count(a.appln_id) into count(distinct (a.appln_id)) Each NUTS area gets a 1 for the simple fact that at least 1 inventor is from that area.
Another method is to weigh the results, so that the sum for this single application is 1, and each of the
NUTS codes will gets its share depending on the number of inventors we have from that area. Keeping in mind that for some inventors I don't have a NUTS code, the easiest is now to use a LEFT JOIN, and to give the "NULL area" its share by using the attribute nb_inventors from the tls201_appln table.

Code: Select all

SELECT p.nuts, nuts_label,  sum(1.00 /nb_inventors) applications_share
FROM tls201_appln a
JOIN tls207_pers_appln p_a on a.appln_id = p_a.appln_id
JOIN tls206_person p ON p.person_id = p_a.person_id
LEFT JOIN tls904_nuts n on p.nuts = n.nuts 
WHERE invt_seq_nr > 0  -- only INVENTORS
and a.appln_id = 527685245
group by p.nuts, nuts_label
ORDER BY applications_share desc
In your query, you also had a join with the IPC table in order to only retain applications having a certain ipc_code.

Code: Select all

SELECT p.nuts, nuts_label,  sum(1.00 /nb_inventors) applications_share
FROM tls201_appln a
JOIN tls207_pers_appln p_a on a.appln_id = p_a.appln_id
JOIN tls206_person p ON p.person_id = p_a.person_id
LEFT JOIN tls904_nuts n on p.nuts = n.nuts 
join tls209_appln_ipc on a.appln_id = tls209_appln_ipc.appln_id
WHERE invt_seq_nr > 0  -- only INVENTORS
and a.appln_id = 527685245
group by p.nuts, nuts_label
ORDER BY applications_share desc
You will see that the results of the weighing have changed. This application has 4 IPC codes, so each of the 65 original rows is replicated 4 times, and the weighing as well. The total sum is now 4, and that is not what we want. Depending on your research, the best practice is to specify the IPC conditions via the WHERE clause if you don't have the intention to look into aggregated data at NUTS/IPC level. (something like: what area is strong in what kind of technology based ON IPC would require a different approach.)

Code: Select all

SELECT p.nuts, nuts_label,  n.nuts_level, sum(1.00 /nb_inventors) applications_share
FROM tls201_appln a
JOIN tls207_pers_appln p_a on a.appln_id = p_a.appln_id
JOIN tls206_person p ON p.person_id = p_a.person_id
LEFT JOIN tls904_nuts n on p.nuts = n.nuts 
WHERE invt_seq_nr > 0  -- only INVENTORS
and a.appln_id = 527685245
and a.appln_id in (select appln_id FROM tls209_appln_ipc where ipc_class_symbol in ('H04L  29/08'))
group by p.nuts, nuts_label, n.nuts_level
ORDER BY applications_share desc
If you now want the data aggregated at NUTS 2 level, we run a bit into an issue as we do not have the NUTS data for all inventors. We can aggregate NUTS3 (5 digits )to NUTS 2 (4 digits) , but we can not move level 0 (or 1) to level 2. As a result, the sum of the weighted values will not be 1 anymore if you remove Level 0 and Level 1 codes. In the query below, I kept the level 0 and Null values, but you can easily remove them manually, or by adding something like "having len(left(p.nuts,4)) =4" between the GROUP BY and ORDER BY clause. (not very elegant but it does the job without increasing complexity.)

Code: Select all

SELECT left(p.nuts,4) , temp.nuts_label,   sum(1.00 /nb_inventors) applications_share
FROM tls201_appln a
JOIN tls207_pers_appln p_a on a.appln_id = p_a.appln_id and nb_inventors > 0
JOIN tls206_person p ON p.person_id = p_a.person_id
LEFT JOIN (select nuts, nuts_level, nuts_label from tls904_nuts where nuts_level <3) temp   on left(p.nuts,4) = temp.nuts
WHERE invt_seq_nr > 0  and nb_inventors > 0 -- only INVENTORS
and a.appln_id = 527685245
and a.appln_id in (select appln_id FROM tls209_appln_ipc where ipc_class_symbol in ('H04L  29/08'))
group by left(p.nuts,4), nuts_label
--having len(left(p.nuts,4)) =4
ORDER BY applications_share desc;
You can see that Budapest and Pest have been grouped into Közép-Magyarország and Östergötlands län together with Uppsala län have been grouped into Östra Mellansverige. The sum is still 1 (including the lower levels 0 and NULL). You can now use the above query with your complete set of patents based on the IPC codes.
And further:
A couple of label names seem to be missing (will look into that).
For some addresses we do not have the NUTS codes assigned yet (the NUTS codes are assigned based on the previous PATSTAT release.) There is no easy fix to that.
And there are also applications for which we don't have the inventor data at all. (According to the EPC, inventors can have their names removed from the patent publication.) To avoid a "divide by zero fault" on weighing of applications without inventors, we have to exclude applications that have no inventor data from the sample. (This is done in the above query via the tls201_appln join.)
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


Post Reply