Problem: CPC-classes, applicants & DOCDB families

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

lennartschott
Posts: 6
Joined: Mon Jun 01, 2020 3:45 pm

Problem: CPC-classes, applicants & DOCDB families

Post by lennartschott » Mon Jun 01, 2020 3:59 pm

Hey everyone,

My concern is the following: I'm looking for patents that involve a certain CPC class. I would now like to obtain via PATSTAT all patents (at the level of the DOCDB families) on which this technology class is present. I would also like to receive all participating patent applicants and their country code. In addition, I would like to retrieve all CPC classes that can be found on the patent with the CPC class I am looking for. I have used the following code, but do not get the desired result:

Code: Select all

SELECT tls225_docdb_fam_cpc.docdb_family_id, tls201_appln.docdb_family_size, tls201_appln.appln_id, tls201_appln.appln_filing_year, tls201_appln.appln_auth, tls201_appln.nb_applicants, tls201_appln.nb_inventors, tls206_person.person_id, tls206_person.person_name, tls206_person.person_ctry_code, tls225_docdb_fam_cpc.cpc_class_symbol  
FROM tls201_appln
JOIN tls207_pers_appln ON tls201_appln.appln_id=tls207_pers_appln.appln_id
JOIN tls206_person ON tls207_pers_appln.person_id= tls206_person.person_id
JOIN tls225_docdb_fam_cpc ON tls201_appln.docdb_family_id=tls225_docdb_fam_cpc.docdb_family_id
WHERE tls201_appln.appln_filing_year>=2010
AND tls225_docdb_fam_cpc.docdb_family_id in (select distinct docdb_family_id from  tls225_docdb_fam_cpc where cpc_class_symbol =  'H04W   4/046')
AND (applt_seq_nr>0)
order by tls225_docdb_fam_cpc.docdb_family_id, tls201_appln.appln_id, cpc_class_symbol, person_name
Thanks in advance!

All the best,
Lennart


EPO / PATSTAT Support
Posts: 425
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Problem: CPC-classes, applicants & DOCDB families

Post by EPO / PATSTAT Support » Tue Jun 02, 2020 12:28 pm

Hello Lennart,
there is a small typo in your query; CPC codes need to have 8 characters before the / .
This means that one has to add "spaces" to make it 8 characters.
See query below, I also used the STRING_AGG function and a GROUP BY to reduce the number of rows, and de_duplicate the CPC codes.
But you can of course change this back if you need a separate line for each CPC code.

Code: Select all

SELECT  tls201_appln.docdb_family_id, tls201_appln.docdb_family_size, tls201_appln.appln_id, tls201_appln.appln_filing_year,
tls201_appln.appln_auth, tls201_appln.appln_nr, tls201_appln.nb_applicants, tls201_appln.nb_inventors, 
tls206_person.person_id, tls206_person.person_name, tls206_person.person_ctry_code,
STRING_AGG(CAST( CPC.cpc_class_symbol AS VARCHAR(MAX)), ',') CPC
FROM tls201_appln
JOIN tls207_pers_appln ON tls201_appln.appln_id = tls207_pers_appln.appln_id
JOIN tls206_person ON tls207_pers_appln.person_id= tls206_person.person_id
JOIN (select docdb_family_id , cpc_class_symbol from tls225_docdb_fam_cpc group by docdb_family_id , cpc_class_symbol  ) CPC ON tls201_appln.docdb_family_id = CPC.docdb_family_id
WHERE tls201_appln.appln_filing_year >= 2010
AND cpc.docdb_family_id in (select distinct docdb_family_id from tls225_docdb_fam_cpc where cpc_class_symbol = 'H04W   4/46')
AND (applt_seq_nr > 0)
group by tls201_appln.docdb_family_id, tls201_appln.docdb_family_size, tls201_appln.appln_id, tls201_appln.appln_filing_year,
tls201_appln.appln_auth, tls201_appln.appln_nr, tls201_appln.nb_applicants, tls201_appln.nb_inventors, 
tls206_person.person_id, tls206_person.person_name, tls206_person.person_ctry_code
order by tls201_appln.docdb_family_id, tls201_appln.appln_id, person_name
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


lennartschott
Posts: 6
Joined: Mon Jun 01, 2020 3:45 pm

Re: Problem: CPC-classes, applicants & DOCDB families

Post by lennartschott » Wed Jun 03, 2020 10:12 am

Thank you for the quick help!

This is the code I am using now:

Code: Select all

SELECT  DISTINCT tls201_appln.docdb_family_id, tls201_appln.docdb_family_size, tls201_appln.earliest_filing_year, 
tls201_appln.nb_applicants, tls206_person.person_name, tls206_person.person_ctry_code, CPC.CPCgrouped
FROM tls201_appln
JOIN tls207_pers_appln ON tls201_appln.appln_id = tls207_pers_appln.appln_id
JOIN tls206_person ON tls207_pers_appln.person_id= tls206_person.person_id
JOIN (select docdb_family_id , STRING_AGG(CAST(cpc_class_symbol AS VARCHAR(MAX)), ',')CPCgrouped 
		from tls225_docdb_fam_cpc group by docdb_family_id ) CPC ON tls201_appln.docdb_family_id = CPC.docdb_family_id
WHERE tls201_appln.earliest_filing_year >= 2010
AND tls201_appln.earliest_filing_year < 9999
AND cpc.docdb_family_id in (select distinct docdb_family_id from tls225_docdb_fam_cpc where cpc_class_symbol = 'B60C  23/0408' 
OR cpc_class_symbol LIKE 'B60L%' 
OR cpc_class_symbol = 'Y04S  30/126' 
OR cpc_class_symbol = 'Y04S  30/14')
AND (applt_seq_nr > 0)
group by tls201_appln.docdb_family_id, tls201_appln.docdb_family_size, tls201_appln.nb_applicants, 
tls206_person.person_name, tls206_person.person_ctry_code, tls201_appln.earliest_filing_year, CPC.CPCgrouped
order by tls201_appln.docdb_family_id, person_name

Within the same DOCDB family ID, the number of applicants varies across the different applications. So if I now also query the number of applicants (which I definitely want to do), this will result in the same applicants appearing multiple times per family ID (sometimes even with different spellings). I would now like to simply query the cases that can be assigned to the application with the most applicants within the family ID. I then assume that all persons who were involved in the patent through the different applications appear once (and no longer several times). Would that solve the problem or can you think of something else?

Thanks in advance!
Best,
Lennart


EPO / PATSTAT Support
Posts: 425
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Problem: CPC-classes, applicants & DOCDB families

Post by EPO / PATSTAT Support » Wed Jun 03, 2020 3:22 pm

Hello Lennart,
I took the freedom to correct the cpc codes in your query, the blanks should be added like this

Code: Select all

'Y04S  30/14' and not 'Y04S30  /14'
Ideally you would have access to a full PATSTAT data base instead of PATSTAT Online. That would allow you to create intermediate tables that contain data agregated at family level instead of application level. So I did a couple of changes to try to squeeze in as much a possible. The name variations are partly solved by using the PSN_NAME. I also removed the inventors - assuming you don't need them. And if you want to further remove, you could also let the person country code go if you don't need that for your analysis. That is probably the maximum you can push it on PATSTAT Online. If you are not happy with that, then I would download a data base extraction based on the appln_id and do any further analysis of-line in MS_ACCESS.

Code: Select all

SELECT temp1.*, tls201_appln.docdb_family_size, 
temp.max_applicants, tls206_person.psn_name, tls206_person.person_ctry_code, CPC.CPCgrouped
FROM tls201_appln
JOIN tls207_pers_appln ON tls201_appln.appln_id = tls207_pers_appln.appln_id
JOIN tls206_person ON tls207_pers_appln.person_id= tls206_person.person_id
JOIN (select docdb_family_id , STRING_AGG(CAST(cpc_class_symbol AS VARCHAR(MAX)), ',')CPCgrouped 
		from tls225_docdb_fam_cpc group by docdb_family_id ) CPC ON tls201_appln.docdb_family_id = CPC.docdb_family_id
join (select docdb_family_id, max(nb_applicants) max_applicants from tls201_appln group by docdb_family_id) temp on tls201_appln.docdb_family_id= temp.docdb_family_id
join (select docdb_family_id, min(earliest_filing_date) min_date  from tls201_appln group by docdb_family_id) temp1 on tls201_appln.docdb_family_id = temp1.docdb_family_id
WHERE tls201_appln.earliest_filing_year >= 2010
AND tls201_appln.earliest_filing_year < 9999
AND cpc.docdb_family_id in (select distinct docdb_family_id from tls225_docdb_fam_cpc where cpc_class_symbol = 'B60C  23/0408' 
OR cpc_class_symbol LIKE 'B60L%' 
OR cpc_class_symbol = 'Y04S  30/126' 
OR cpc_class_symbol = 'Y04S  30/14')
AND (applt_seq_nr > 0) and (invt_seq_nr = 0)
group by tls201_appln.docdb_family_size, temp1.min_date, temp1.docdb_family_id,
temp.max_applicants, tls206_person.psn_name, tls206_person.person_ctry_code, CPC.CPCgrouped,temp1.docdb_family_id
order by temp1.docdb_family_id, psn_name
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


lennartschott
Posts: 6
Joined: Mon Jun 01, 2020 3:45 pm

Re: Problem: CPC-classes, applicants & DOCDB families

Post by lennartschott » Wed Jun 03, 2020 4:02 pm

Thank you for your support!
Unfortunately, when I run the SQL query on PATSTAT Online, I get the following error message:

[Error Code: 207, SQL State: S0001] Invalid column name 'min_date'

What needs to be changed?


EPO / PATSTAT Support
Posts: 425
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Problem: CPC-classes, applicants & DOCDB families

Post by EPO / PATSTAT Support » Wed Jun 03, 2020 6:53 pm

I adapted the query; PATSTAT Online is a bit particular about using renaming columns in tables.
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


lennartschott
Posts: 6
Joined: Mon Jun 01, 2020 3:45 pm

Re: Problem: CPC-classes, applicants & DOCDB families

Post by lennartschott » Wed Jun 03, 2020 8:08 pm

Thanks for your great support - it worked!

However, I have now come across the following two issues and I am not quite sure what to make of them:

1) During further data inspection I came across the fact that with the new query the number of displayed psn_names of a family_id no longer matches the number in the new column max_applicants. For example, if the column max_applicants returns the value 3, e.g. only 2 rows (and thus only 2 psn_names) are displayed for the same family_id. I would like to have as many rows per family_id as there are applicants on it. I seem to be losing information about some applicants now.

2) Furthermore, I noticed that in the newly created column for the CPC classes some classes appear twice/multiple times. How could that be fixed?

Thank you so much!
Best wishes,
L.


EPO / PATSTAT Support
Posts: 425
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Problem: CPC-classes, applicants & DOCDB families

Post by EPO / PATSTAT Support » Wed Jun 03, 2020 10:00 pm

1) If you remove the "and (invt_seq_nr = 0)", you will get all the applicants back. It basically excludes many of the "natural persons" from the applicant list, namely the ones that are applicants.
2) CPC codes can be assigned by multiple offices and that can create duplicates.
De-duplicating is very hard without making intermediate tables (and STRING_AGG does not accept a DISTINCT). I would recommend to extract a data set and create intermediate tables which will make it much easier.
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


lennartschott
Posts: 6
Joined: Mon Jun 01, 2020 3:45 pm

Re: Problem: CPC-classes, applicants & DOCDB families

Post by lennartschott » Thu Jun 04, 2020 8:32 pm

Thanks! I removed the "and (invt_seq_nr = 0)", but during the further data preparation it turned out that there are still a few problems for which I currently have no solution.
When I use the query suggested by you, in many cases, there are still more rows per family_id than the maximum number of applicants involved on the respective family_id suggests.
I think this might always be the case when there are several applications within a family with the same (max.) number of applicants (e.g. two applications with 6 applicants, which is the max. within the family id). If the identical applicants then appear on the two applications with different spellings, they are no longer recognized as identical and are output several times in this query.
And this then leads to the described effect that there are too many rows per family_id. Do you have a smart solution to how I could surround this problem?

Or would my question have already been answered with your response from yesterday?
That is probably the maximum you can push it on PATSTAT Online. If you are not happy with that, then I would download a data base extraction based on the appln_id and do any further analysis of-line in MS_ACCESS.
Thanks & all the best,
L.


EPO / PATSTAT Support
Posts: 425
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Problem: CPC-classes, applicants & DOCDB families

Post by EPO / PATSTAT Support » Wed Jun 10, 2020 4:10 pm

Hello Lennart,
your analysis is correct; family members can have different spellings in the applicant name, the number of applicants can be different (that is why we took MAX); it could in principle even be different companies, and also the presence of a country code will lead to "different persons".
A typical example is the common given of inventors being flagged as applicants for US applications. That is the reason why we often force invt_seq_nr = 0 to exclude those names from the list. But it will also exclude the patents where there is no company involved, and all applicants are inventors..--> out goes the application.
It is a typical problem when researchers do "family based analysis" on a data base that is "application based". The ideal solution would be to create a tailor made table that groups all the "family based data attributes" based on the criteria you need for your research. You can do that by extracting a PATSTAT subset, and then aggregate data to family level using your own criteria.
I have seen research groups doing this, but it would be interesting if this forum could come up with a consensus on how to create such table(s) to allow quick "family based indicators" without the data grouping problems.
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


Post Reply