Sorry this is not a full answer, and not in a logical order either. When I have more time I will try to answer in full.
IPC and CPC codes have a certain format, which is why Y02E 10/72% gives no results.
There should be two spaces between Y02E and 10. Or, put differently, there is room for 4 digits following Y02E and if you use less digits than it is left padded with spaces.
Are you using PATSTAT Online or a local installation? I am a big fan of using intermediate tables, but that is not possible with the Online version (unless you extract the data and import it in a local database).
When counting citations and working with patent families you should consider that family X is citing family Y, not just that patents X1-Xn are citing family Y. Also, what is your approach to self citations?
Unfortunately there is also a "hidden" double count. The culprit is the `citn_origin` field. Every citation has an origin, and there are eight different origin types. That would have been no problem, except for the fact that the same citation can come from different origins
, e.g. from both the applicant and the examiner.
SEA - citations introduced during search
APP - citations introduced by the applicant
EXA - citations introduced during examination
OPP - citations introduced during opposition
115 - citations introduced according to Art 115 EPC
ISR - citations from the International Search Report
SUP - citations from the Supplementary Search Report
CH2 - citations introduced during the Chapter 2 phase of the PCT
You can remove the double count with a COUNT(DISTINCT(`CITED_PAT_PUBLN_ID`)).
Something else to take into consideration... "A very important feature of European search reports is the allocation of search codes to each reference signifying its relevancy to the patent application in question in terms of the three criteria of patentability: novelty, inventive activity and industrial applicability (see the table below). These characteristics allow researchers to use the classification for weighting or filtering purposes, there is evidence that the composition of patent citations may matter considerably.
A Documents defining the general state of the art (but not belonging to X or Y)
D Documents cited in the application i.e. already mentioned in the description of the patent application
E Potentially conflicting documents – Any patent document bearing a filing or priority date earlier than the filing date of the application searched but published later than that date, and the content of which would constitute prior art
L Documents cited for other reasons (e.g. a document that may throw doubt on a priority claim)
O Documents which refer to non-written disclosure
P Intermediate documents - documents published between the date of filing of the application being examined and the date of priority claimed
T Documents relating to the theory or principle underlying the invention (documents which were published after the filing date and are not in conflict with the application, but were cited for a better understanding of the invention)
X Particularly relevant documents when taken alone (a claimed invention cannot be considered novel or cannot be considered to involve an inventive step)
Y Particularly relevant documents if combined with one or more other documents of the same category,- such a combination being obvious to a person skilled in the art
Btw, "all applications that have been cited at least once (also implying they're granted, right?
)" is an assumption that at first glance I would agree with, but if it is important to you that the patents are granted you should verify it. I am guessing that a company can cite its own patents even if these patents have not been granted.
You can always contact me at firstname.lastname@example.org
though for the benefit of the community it's best if you post your questions and answers here as well. (It's just that this board does not notify me of new posts!)