Patent count & forward citations for entire CPC category

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

Posts: 3
Joined: Thu Jul 02, 2015 12:36 pm

Patent count & forward citations for entire CPC category

Post by AlfonsoMA » Fri Jul 03, 2015 5:42 am


I would like to have some very rough idea of the state of innovation in specific technological fields. I have two goals:

1. Observe the patenting trends for an entire CPC category
2. Calculate average forward citations per patent family for an entire CPC category

I'm new to SQL and Patstats so unsurprisingly I've run into some difficulties.

I've tried queries such as this:

Code: Select all

SELECT DISTINCT t.appln_id, nb_citing_docdb_fam, prior_earliest_date, a.docdb_family_id
FROM tls224_appln_cpc t
JOIN tls201_appln ap ON ap.appln_id = t.appln_id
JOIN tls218_docdb_fam a ON t.appln_id = a.appln_id
JOIN tls228_docdb_fam_citn c ON c.docdb_family_id = a.docdb_family_id
WHERE t.cpc_class_symbol LIKE 'Y02C%'
AND ap.nb_citing_docdb_fam > 0 
which, just to confirm, I gives me a table with:
  • all applications that have been cited at least once (also implying they're granted, right?)
    the patent families they belong to,
    the number of forward citations per application,
    the first date the first application was filed.
However, beyond here I get stuck:

a) How do I merge the citation data so that I get one row per family? I'm not interested in keeping the appln_id in the final output but I understand it's necessary to use it... yet it also seems to be causing problems. I tried using GROUP BY, which gives me the following "Error Code: 8120, SQL State: S0001] Column 'tls224_appln_cpc.appln_id' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause"

b) Can I use the aggregate function AVG for my purposes or am I better off exporting the data to Excel?

In addition, I have an odd problem:

c) if I replace the category by a more precise one, such as 'Y02E 10/72%', results are 0.I've checked WIPO Standards and Patstat's instructions for this to no avail. Does anyone have any examples of how this works?

Any comments on any aspect of this would be most welcome!

Thank you very much.

Best regards


Posts: 140
Joined: Wed Jul 08, 2009 5:51 pm

Re: Patent count & forward citations for entire CPC category

Post by nico.rasters » Fri Jul 03, 2015 3:59 pm

Sorry this is not a full answer, and not in a logical order either. When I have more time I will try to answer in full.

IPC and CPC codes have a certain format, which is why Y02E 10/72% gives no results.
There should be two spaces between Y02E and 10. Or, put differently, there is room for 4 digits following Y02E and if you use less digits than it is left padded with spaces.

Are you using PATSTAT Online or a local installation? I am a big fan of using intermediate tables, but that is not possible with the Online version (unless you extract the data and import it in a local database).

When counting citations and working with patent families you should consider that family X is citing family Y, not just that patents X1-Xn are citing family Y. Also, what is your approach to self citations?

Unfortunately there is also a "hidden" double count. The culprit is the `citn_origin` field. Every citation has an origin, and there are eight different origin types. That would have been no problem, except for the fact that the same citation can come from different origins, e.g. from both the applicant and the examiner.

SEA - citations introduced during search
APP - citations introduced by the applicant
EXA - citations introduced during examination
OPP - citations introduced during opposition
115 - citations introduced according to Art 115 EPC
ISR - citations from the International Search Report
SUP - citations from the Supplementary Search Report
CH2 - citations introduced during the Chapter 2 phase of the PCT

You can remove the double count with a COUNT(DISTINCT(`CITED_PAT_PUBLN_ID`)).

Something else to take into consideration... "A very important feature of European search reports is the allocation of search codes to each reference signifying its relevancy to the patent application in question in terms of the three criteria of patentability: novelty, inventive activity and industrial applicability (see the table below). These characteristics allow researchers to use the classification for weighting or filtering purposes, there is evidence that the composition of patent citations may matter considerably."

Code Meaning
A Documents defining the general state of the art (but not belonging to X or Y)
D Documents cited in the application i.e. already mentioned in the description of the patent application
E Potentially conflicting documents – Any patent document bearing a filing or priority date earlier than the filing date of the application searched but published later than that date, and the content of which would constitute prior art
L Documents cited for other reasons (e.g. a document that may throw doubt on a priority claim)
O Documents which refer to non-written disclosure
P Intermediate documents - documents published between the date of filing of the application being examined and the date of priority claimed
T Documents relating to the theory or principle underlying the invention (documents which were published after the filing date and are not in conflict with the application, but were cited for a better understanding of the invention)
X Particularly relevant documents when taken alone (a claimed invention cannot be considered novel or cannot be considered to involve an inventive step)
Y Particularly relevant documents if combined with one or more other documents of the same category,- such a combination being obvious to a person skilled in the art

Btw, "all applications that have been cited at least once (also implying they're granted, right?)" is an assumption that at first glance I would agree with, but if it is important to you that the patents are granted you should verify it. I am guessing that a company can cite its own patents even if these patents have not been granted.

You can always contact me at though for the benefit of the community it's best if you post your questions and answers here as well. (It's just that this board does not notify me of new posts!)
Nico Doranov
Data Manager

Daigu Academic Services & Data Stewardship

Posts: 3
Joined: Thu Jul 02, 2015 12:36 pm

Re: Patent count & forward citations for entire CPC category

Post by AlfonsoMA » Sat Jul 04, 2015 11:47 am

Thank you for your reply Nico! very useful already!

Unfortunately, I am using the online version.

Let me see if I can bounce off the ideas you put forward. In any case, taking your advice, I will search explicitly for granted patents when calculating forward citations as an added measure of quality and to avoid problems.

Regarding the types of citations, I was using the nb_citing_docdb_fam figure, which according to the data catalog only includes the "search citations" from tls212_citation at the family level. How much of an approximation can this be to the actual level of family-to-family citations? I'm trying to use these citation counts as an indication of the level of cohesiveness among the inventions in a field (akin to asking: are inventions building on one another?). I'd be content with some indication that could be relative to other fields. For instance, if we found that families cited each other an average of n times in fields that have seen breakthroughs in recent years, we could at least surmise that another field is less likely to have breakthrough if there are substantially less than n inter-family citations on average (all other things being equal). Still thinking this through though.

If it is possible to do this calculation at a family to family level using tls212_citation, I'd be happy to count any type of forward citations, regardless of whether these were added by the applicant or the reviewers in any office, including self citations (meaning applicants citing their own patents?), assuming that all these citations have been accepted as relevant by reviewers (as I assume is the case for granted patents?). In this case, however, double counting as you describe may be much more of a problem. Mere repetitions of the same citation would spuriously inflate that level of cohesiveness that I'm trying to ascertain.

With all the above in mind, I'm really at a loss on how to implement your COUNT suggestion to avoid double counting while focusing on family-level. Even the fairly simple queries I've tried to get started return meaningless results. For example, the following code gives an end result of 1.

Code: Select all

FROM tls224_appln_cpc t
JOIN tls201_appln ap ON ap.appln_id = t.appln_id
JOIN tls212_citation cit ON ap.appln_id = cit.cited_appln_id
WHERE t.cpc_class_symbol LIKE 'Y02C  10%'
Other more complex queries including families return syntax problems, which sound like they could only be solved by intermediate tables...

Specifically on the coding issue, it seems extremely useful... but I don't know where I am supposed to search. Is this information available on PatStat Online?

Beyond this, I've been thinking of using a simple count per year of families filed to give an idea of the level invention efforts. Would this be a reasonable (and straightfoward) way to do it?

Thanks again to everyone for any comments!

Posts: 140
Joined: Wed Jul 08, 2009 5:51 pm

Re: Patent count & forward citations for entire CPC category

Post by nico.rasters » Sat Jul 04, 2015 1:09 pm

Chapter 6 of ... manual.htm deals with citations.

If I understand correctly you are identifying radical inventions through a forward citation patent count?
Are you familiar with ... of_science? Do you think it might influence citation behaviour?
Is it relevant from which technological fields citations are coming? A count by itself does not reveal this.

I think you are safe with the nb_citing_docdb_fam count with regards to the double count and family-family citations. Even if it's only an approximation (which it probably isn't), it does not matter because it's a standard variable so its characteristics are known and other researchers can reproduce your results. That in my opinion is more important than having perfect but obscure data.

Note that patent offices have their own unique practices and this influences citation counts. In the US, applicants are rather overzealous in adding citations, while in Europe there is no requirement for applicants to cite. So overall USPTO patents will have more citations than EP patents. Even within the same patent office the citation practice over time may change, leading to an increase or decrease in citations. Also, not all publications are issued at the same moment, so they do not have the same opportunity to cite or be cited. A (silly) example: if I were to file the very first patent in the whole world, then there would be no other patents to cite. The more patents there are, the more citations there can be. And besides patent office practices and the effect of time there is also the effect of the technological field the patent is part of.

The nb_citing_docdb_fam count is based on the DOCDB family. I prefer DOCDB over INPADOC, but -in my opinion- it has one huge problem (well, two problems). The priority patent itself is not included in the family. Now one can argue that this is due to the definition, but that could be resolved by introducing a self-citation for priorities. Anyway, the result is that you have patent P1 (the priority patent) and patents P2-Pn which make up the DOCDB family. P1 and P2-Pn are the same invention. P1 will also have publications and citations, but because P1 is not part of the family its citations are not included in nb_citing_docdb_fam. P1 is invisible. And as DOCDB is based on expert intervention I have no idea how to find the actual priorities.
The Equivalents family would be better. See

Your idea to start with a simple count per year of families filed is a good one. Which year(s) will you be using? A family has 1-to-many applications, each application has 1-to-many publication, each publication has 0-to-many citations. Applications and publications have their own dates. Then there's also the grant date and the priority date. When does an inventor cite another patent? Can it only be done once a patent has been published or in other words when the knowledge becomes available to the outside world? In that case I could argue that it would be most fair to use the publication date of each publication. But that's not possible with your nb_citing_docdb_fam variable. In any case it is a total count so you can not do a citation count per year.

I always start by manually looking at one example. Query an application, export to Excel. Query the publications, export to Excel, etc. Otherwise it is very easy to get lost in the query building.
Btw, most likely you can do everything within a single query. EPO's Martin Kracker can manage that anyway. One of the reasons why I use intermediate tables is because I lack Martin's skills.

Start simple. Write down all the caveats. Then try to solve those step by step.
Data does not have to be perfect, as long as you perfectly understand (and document) the flaws.
Nico Doranov
Data Manager

Daigu Academic Services & Data Stewardship

Posts: 140
Joined: Wed Jul 08, 2009 5:51 pm

Re: Patent count & forward citations for entire CPC category

Post by nico.rasters » Sat Jul 04, 2015 1:14 pm

I forgot to answer your question about the coding. I have an older version of PATSTAT so your mileage may vary.

Table `TLS215_CITN_CATEG` contains the categories of citations, but only if the CITN_ORIGIN of the citation is "SEA" (citations introduced during search). An example using the publication with PAT_PUBLN_ID=7:

Code: Select all

7 		1	A
7 		2 	A
7 		3 	A
7 		4 	A
7 		5 	D
7 		5 	P
7 		5 	X
7 		6 	P
7 		6 	Y
7 		7 	X
7 		8 	A

Code: Select all

# 8 total, Query took 0.0150 sec
7 		1	 	60325621 		0 		1 		0 		0	     	0 	 
7 		2	 	60325553 		0 		2 		0 		0     		0 	 
7 		3	 	60325552 		0 		3 		0 		0     		0 	 
7 		4	 	60276872 		0 		4 		0 		0     		0 	 
7 		5	 	28401739 		0 		5 		0 		0     		0 	 
7 		6	 	28450486 		0 		6 		0 		0     		0 	 
7 		7	 	0 			1 		0 		1 		0     		0 	 
7 		8	 	0 			2	 	0 		2 		0     		0 	 
In both tables we find CITN_IDs ranging from 1 to 8. The (PAT_PUBLN_ID,CITN_ID) pairs are linked, so CITED_PAT_PUBLN_ID 60325621 has category A. Note that CITN_ID=5 has more than one category, namely D,P,X. The same goes for CITN_ID=6 with categories P,Y. And apparently NPL citations can have categories too.
Nico Doranov
Data Manager

Daigu Academic Services & Data Stewardship

Post Reply