diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 0000000..1c45809 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,22 @@ +## Overview + +Brief description of what this PR does, and why it is needed. + +If this pr closes an issue, make note of it here 👇 +Closes #XXX + +### Demo + +Optional. Screenshots, `curl` examples, etc. + +### Notes + +Optional. Ancillary topics, caveats, alternative strategies that didn't work out, anything else. + +## Testing Instructions + +* How to test this PR +* Prefer bulleted description +* Start after checking out this branch +* Include any setup required, such as bundling scripts, restarting services, etc. +* Include test case, and expected output diff --git a/README.md b/README.md index de38dc0..fb38ea0 100644 --- a/README.md +++ b/README.md @@ -68,7 +68,7 @@ Having trouble building the code? [Open an issue](https://github.com/datamade/us ### Adding new training data -If usaddress is consistently failing on particular address patterns, you can adjust the parser's behavior by adding new training data to the model. [Follow our guide in the training directory](https://github.com/datamade/usaddress/blob/master/training/README.md), and be sure to make a pull request so that we can incorporate your contribution into our next release! +If usaddress is consistently failing on particular address patterns, you can adjust the parser's behavior by adding new training data to the model. [Follow our guide in the training directory](./training/README.md), and be sure to make a pull request so that we can incorporate your contribution into our next release! ## Important links @@ -91,7 +91,7 @@ If usaddress is consistently failing on particular address patterns, you can adj Report issues in the [issue tracker](https://github.com/datamade/usaddress/issues) -If an address was parsed incorrectly, please let us know! You can either [open an issue](https://github.com/datamade/usaddress/issues/new) or (if you're adventurous) [add new training data to improve the parser's model.](https://github.com/datamade/usaddress/blob/master/training/README.md) When possible, please send over a few real-world examples of similar address patterns, along with some info about the source of the data - this will help us train the parser and improve its performance. +If an address was parsed incorrectly, please let us know! You can either [open an issue](https://github.com/datamade/usaddress/issues/new) or (if you're adventurous) [add new training data to improve the parser's model.](./training/README.md) When possible, please send over a few real-world examples of similar address patterns, along with some info about the source of the data - this will help us train the parser and improve its performance. If something in the library is not behaving intuitively, it is a bug, and should be reported. @@ -103,4 +103,4 @@ If something in the library is not behaving intuitively, it is a bug, and should ## Copyright -Copyright (c) 2025 Atlanta Journal Constitution. Released under the [MIT License](https://github.com/datamade/usaddress/blob/master/LICENSE). +Copyright (c) 2025 Atlanta Journal Constitution. Released under the [MIT License](./LICENSE). diff --git a/measure_performance/test_data/labeled.xml b/measure_performance/test_data/labeled.xml index 70672bc..9077f7a 100644 --- a/measure_performance/test_data/labeled.xml +++ b/measure_performance/test_data/labeled.xml @@ -126,4 +126,34 @@ 150 Citizens Circle Little River, South Carolina 29566 United States 4079 U.S. 17 Business Murrells Inlet, South Carolina 29576 United States 43 South Broadway Pitman, New Jersey 08071 United States + HC 2333 Box 85 + HC 284 Box 27 + HC 7326 Box 66 + HC 992 Box 88 + HC R 32 Box # e3 + HC ROUTE 72 BOX 1A + HIGHWAY CONTRACT rte # 46 BOX # 992 + HIGHWAY CONtraCT ROUTE 56 BOX 45C + StaR ROUTE 75 BOX 5Z + HCR 4e box # 32 + HCR 88 bOX 76E + HWY CONTRACT ROUTE 102 BOX 255A + 4510 COUNTY ROAD GV, APPLETON, WI 54913 + 7575 COUNTY ROAD ZZZ, MILWAUKEE, WI 54567 + 123A E COUNTY ROAD DV, WAUPACA, WI 54981 + 1331 COUNTY ROAD AA NE, AMHERST JUNCTION, WI 54407 + 133 W COUNTY ROAD LL, AMHERST, WI 54406 + 123 COUNTY ROAD ABC, APT 12, IOLA, WI 54445 + 200 EAST ELM, DENVER, COLORADO + 55 WINDSOR PLACE, CHAMPAIGN, ILLINOIS + 5 NORTH MAIN, VAN NUYS, CALIFORNIA + 2609 BAYVIEW, FORT LAUDERDALE, FL + 12855 6TH AVE, N. MIAMI, FL 33161 + 783 HOPE ST, PROVIDENCE, RHODE ISLAND 02906 + 200 EAST ELM, DENVER, COLORADO + 977 PLEASANT STREET, N. ORANGE, NJ 07052 + 610 EAST MAIN MARION KANSAS + 10 EAST LAKE, DENVER, COLORADO + 2735 PAWTUCKET AVE EAST PROVIDENCE RHODE ISLAND 02914 + 5548 ELMER AVENUE, N. HOLLYWOOD, CA 91601 diff --git a/pyproject.toml b/pyproject.toml index 53715af..4097e2a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "usaddress" -version = "0.5.13" +version = "0.5.14" description = "Parse US addresses using conditional random fields" readme = "README.md" license = {text = "MIT License", url = "http://www.opensource.org/licenses/mit-license.php"} diff --git a/training/README.md b/training/README.md index aacd997..f98b803 100644 --- a/training/README.md +++ b/training/README.md @@ -280,6 +280,22 @@ Congratulations! The model has officially improved. You can safely move on to st If any of our tests failed, however, things become more complicated. The output will break down the tests that failed, showing you the parse that the model produced (labeled `pred`) and the parse that the test expected (labeled `true`). In this case, jump to step 5a to debug your errors. +If you'd like to additionally spot check singular addresses in the python shell, install a virtual environment, activate it, install your WIP version of this package, and open a shell. +```bash +python3 -m venv .venv +source .venv/bin/activate +pip install -e ".[dev]" -v +python +# shell starts up +>>> +``` + +Then import usaddress and start parsing! +```python +>>> import usaddress +>>> usaddress.parse("a funky address") +``` + **5a. Repeat steps 1-4 until the tests pass.** If you've arrived at this step, it means that some of your tests failed. Uh oh! diff --git a/training/labeled.xml b/training/labeled.xml index 253d568..005ce65 100644 --- a/training/labeled.xml +++ b/training/labeled.xml @@ -1505,4 +1505,44 @@ 3110 West 12th Street Sioux Falls, South Dakota 57104 United States 42 Water Street New Shoreham, Rhode Island 02807 United States 291 Dairy Barn Lane Fort Mill, South Carolina 29715 United States + HC 0903 Box 62 + HC 021 Box 52 + HC ROUTE 68 BOX 23A + HC RTE 24 box 2A + HC rte 15B bOX # 1A + StaR Rte 12A BOX # 455B + star route 24 box # 45 + STAR RTE 102 Box # 95 + HWY Contract RTE 68 BOX 98A + hwy CONTRACT route # 15B BOX # 1A + HCR 99 boX 22B + HCR 45ac box 653 + HIGHWAY CONTRACT ROUTE 12A BOX 285 + HIGHWAY CONTRACT rte # 24 BOX # 2A + 0 COUNTY ROAD MM, AMHERST JUNCTION, WI 54407 + 1000 COUNTY ROAD DB, MOSINEE, WI 54455 + 133 COUNTY ROAD KK, AMHERST, WI 54406 + 5859 COUNTY ROAD DD, WAUPACA, WI 54981 + 10006 COUNTY ROAD MM, AMHERST JUNCTION, WI 54407 + 7529 E COUNTY ROAD MM, Janesville, WI 53546 + 4013 W COUNTY ROAD MM, Lebanon, WI 53098 + 6842 COUNTY ROAD D, Almond, WI 54909 + 1737 COUNTY ROAD VV, Seymour, WI 54165 + COUNTY ROAD UV, Logan Township, NE 68038 + 1224 COUNTY ROAD DV, CASHTON, WI 54619 + 610 EAST MAIN MARION KANSAS + 10 EAST LAKE, DENVER, COLORADO + 2104 WINDSOR PLACE, SAVOY, ILLINOIS + 19 HARGROVE GRADE, PALM COAST FL + 100 WEST SEVENTH, LOS ANGELES, CALIFORNIA + 225 RIDGEDALE AVE, N HANOVER, NJ 07936 + 384 William St, E. Orange, NJ 07017 + 1250 Supply St, N. Charleston, NJ 29405 + 1055 S Broadway, East Providence, Rhode Island 02914 + 55 WEST 10TH DENVER COLORADO + 1301 SE 2ND Fort Lauderdale, FL + 6426 Bellingham Avenue, N. Hollywood, CA 91606 + 510 NE 93rd Miami Shores, FL + 1350 NW 55th Fort Lauderdale, FL + 1600 NE 4TH FORT LAUDERDALE, FL