User solutions

Here users can share their specific solutions, e.g. how they further process the data retrieved from Espacenet.
Any suggestions for additional developments and changes to be requested from the EPO should rather be added to the other categories.
Post Reply

datenblatt
Posts: 1
Joined: Tue Dec 15, 2020 12:51 pm

Espacenet CSV files - field counter for statistical purposes

Post by datenblatt » Tue Jan 26, 2021 4:30 pm

Dear friends, colleagues and fellow users of Espacenet

As part of my bachelor thesis I spent quite some time analysing the CSV files that Espacenet produces.

This was for several reasons. Mostly I wanted to count the number of patents issued per specific year to get a feel for the maturity of a field, and also to create my own graphs with spreadsheet programs. The graphs produced by the Espacenet website are useful, but sometimes, if there is a large number of entries, not everything is reflected. Plus, being able to create the graphs in different styles always looks good.

In short, I have created a tool that parses the CSV files. It's written in BASH, which means it works for any flavor of Linux with a modern Bash shell in it (post 4.0). It might work in mac products, because they use Bash too, but it hasn't been tested. (tell me if you do!)

It is also my first piece of open-source software! :)

What it does

It parses every single Espacenet CSV file present in the same directory as the script and creates, at least, another CSV file with the results. The results are, essentially, a hit count of the requested fields. So, for instance, if you are checking for titles you get a count of how many times a specific title appears.

Additionally it can create UNIX formatted CSV files that are copies of the DOS-formatted CSV files that Espacenet produces.

The CSV files that it produces are UNIX and DOS (ie. Windows) compatible.

Disaggregattion

For each field, my script can disaggregate. For titles it can count the apparitions of each title, or the apparitions of each word in the titles. For CPC codes, the apparition of each individual CPC code or of each group of CPC codes. You get the drift.

For dates (publication dates, priority date etc) it can record the month and year or only the year.

What it cannot do

It cannot disaggregate the applicants and it is unreliable when disaggregating the inventors. This is because, often, they are written in the entry field without any mark that separates the individual inventors/applicants. For instance, some inventors are listed as "Carey Mahoney [US] Tomoko Nogata [JP]", so the script can tell where a name ends and the next starts. Some entries, however, are written "Eric Lassard Carl Proctor", and my script cannot tell, hence, doesn't disaggregate.

Also, some date fields contain several dates. I could extend the script, but I have decided not to do it because I haven't sat down and studied why there are more than one date. I could make the script disaggregate those dates, but I do not touch data whose underlying logic I don't understand, because that opens the door to the subtle corruption of data, and apparent artifacts from the data that aren't there.

How does it work

Usage: ./EPOParser.sh -aCdeiIptu (minimum disaggregation) or -axCxdyeyixIxpytxu (maximum disaggregation) and any option in between.

Text options:

-t Count the times specific title appears.
-i Count the times each inventor or group of inventors appears.
-a Count the times an applicant or group of applicants appears.
-I Count the times a specific grouping of IPC numbers appears.
-C Count the times a specific grouping of CPC numbers appears.

The dissagregation modifier "x" after each of the above options tries to split the entries. -Ix, for instance, it will count the number of times each IPC number appears, instead of the groupings of IPC numbers in each patent. -tx it will count individual words in titles, and not the appearance of each full title.

Exceptions:

The "x" option does not work with the -a option, and is unreliable with the -i options. The reason is that there is no consistent pattern to dissagreggate inventor and applicant names.

Date options:

-d Count how many patents were published each month of each year.
-p Count the priority dates.
-e Count the earliest publication dates.
The modifier "y" after a date option eliminates the month and counts patents per year.
Exceptions: Some fields of priority or earliest publication dates contain more than one date, which may muddy the results.

Special options:

-u Provides a copy of the original, DOS-formatted Espacenet files that is UNIX-formatted, for easier CSV parsing.
--help Displays this help text and exits.
--version Displays version information and exits.

Warnings

It is written with many idiomatic bashisms, it might not work in other shells and it certainly does not work in DASH.

License

GNU GPL V3.0, so you can use it and modify it as you please. The code is exhaustively commented to help that.

Remember attribution, and remember that you cannot use my code for any closed-source software!

Yeah fine, great, but where do I get it?

At my github repository: Click here

How do I use it?

  1. Download EPOParser.sh to the directory where you will work.
  2. Open that directory in the shell. Google it if you don't know what I am talking about. This is unusual but don't be afraid, you will not break your computer doing this!
  3. Type 'chmod +x EPOParser.sh' in the shell (without the quotes of course) and press enter. This marks the script as an executable.
  4. Place the Espacenet epo files in the same directory and enter the instructions, like the usage function shows. For instance: ' ./EPOParser.sh -cCtxdyu '. Or type ./EPOParser.sh --help to read the usage function.
  5. If you are not used to work with linux, remember that you need to write './' before the name of the script and the options.[\list]

    I hope you find it useful! Comments and criticism are welcome.


Post Reply