11 Replies Latest reply: Sep 22, 2008 9:49 PM by 807589 RSS

    Parsing strings to dates with unknown formats

    807589
      I've perused Java's standard libraries and Joda-time's APIs for date parsing tools, but can't find anything good. I'm reading dates from files where the format is not standardized. So I don't know if it'll be in YYYY-MM-dd or fully spelled out. Also, the dates may be from a LONG time ago, before the 15th century, which is when I understand the Date class stops working properly.

      Anybody have any suggestions? I guess I can use Joda-time's DateTimeFormat.forPattern("YYYY-etc...").parse(dateString), catch the formatting exception and try alternative formats successively.... Or do this all manually. But I'm hoping there's a better way and someone has thought of how to handle this. Other's MUST'VE run into this issue before!!
        • 1. Re: Parsing strings to dates with unknown formats
          807589
          Without some hint as to what the format is I think you're doomed - what date should this parse to: 03.08.14 ?
          • 2. Re: Parsing strings to dates with unknown formats
            807589
            For a date like 03.08.14, I think it's pretty clear in the en_US locale that it's March 8, 2014. For a european locale, that might be August 3, 2014. But in this particular case, I can understand a parser giving up and I would have to manually check the date or put no date. However, for Aug 3, 2014 or 3 August 2014, I would fully expect a parser to figure that out. I can't believe nobody's dealt with this before and created a standard library. Even our simple DatePicker on our website has this capability in Javascript using Ext JS' library:
            [http://extjs.com/deploy/ext/docs/output/Ext.form.DateField.html]
            • 3. Re: Parsing strings to dates with unknown formats
              807589
              scottatstartel wrote:
              For a date like 03.08.14, I think it's pretty clear in the en_US locale that it's March 8, 2014. For a european locale, that might be August 3, 2014. But in this particular case, I can understand a parser giving up and I would have to manually check the date or put no date. However, for Aug 3, 2014 or 3 August 2014, I would fully expect a parser to figure that out. I can't believe nobody's dealt with this before and created a standard library. Even our simple DatePicker on our website has this capability in Javascript using Ext JS' library:
              [http://extjs.com/deploy/ext/docs/output/Ext.form.DateField.html]
              Using SimpleDateFormat with a format string of "MMMdd, yyyy" will handle both your example dates.
              • 4. Re: Parsing strings to dates with unknown formats
                807589
                This Google hit looks interesting: [http://icecube.wisc.edu/~dglo/software/calparse/index.html]
                • 5. Re: Parsing strings to dates with unknown formats
                  807589
                  Thanks uncle_alice, that's the functionality I want. The Calendar class only goes back to the 1500s though as I understand... Maybe I'll just take that code and adapt it to use Joda-time.
                  • 6. Re: Parsing strings to dates with unknown formats
                    807589
                    scottatstartel wrote:
                    For a date like 03.08.14, I think it's pretty clear .....
                    It's not clear to me. It could be 3rd of August 1514 or 1614 or 1714 or........
                    • 7. Re: Parsing strings to dates with unknown formats
                      807589
                      I think it's safe to assume, if you only specify 2 numbers for the year, that it's between maybe 80 years ago and 20 years into the future. '14 is only 5 years and a few months from now. But again, a date like that I wouldn't be too disappointed in my date parser for not returning a date from.
                      • 8. Re: Parsing strings to dates with unknown formats
                        807589
                        scottatstartel wrote:
                        I think it's safe to assume, if you only specify 2 numbers for the year, that it's between maybe 80 years ago and 20 years into the future.
                        Why isn't it safe to assume it goes from 20 years ago to 80 years in the future? Or 60 years ago to 40 years in the future? And would code written 20 years ago make the same "safe to assume" assumptions as code written today or twenty years from today?
                        • 9. Re: Parsing strings to dates with unknown formats
                          807589
                          Using that assumption I bet a lot of people would be surprised to know World War I started in 2014.
                          • 10. Re: Parsing strings to dates with unknown formats
                            masijade
                            BigDaddyLoveHandles wrote:
                            scottatstartel wrote:
                            I think it's safe to assume, if you only specify 2 numbers for the year, that it's between maybe 80 years ago and 20 years into the future.
                            Why isn't it safe to assume it goes from 20 years ago to 80 years in the future? Or 60 years ago to 40 years in the future? And would code written 20 years ago make the same "safe to assume" assumptions as code written today or twenty years from today?
                            I agree with you, but 80/20 is also what SimpleDateFormat uses. Now as to 03.08 (or something similar) for month/day or day/month is tricky, but I would say that if the separator used is a period/dot then it is most likely not us locale and so is most likely day/month, whereas if the separator is "/" then it is most likely US locale and is probably month/day. Not that that is a hard and fast rule, though. Harder would be to properly parse months when the text for the month is used and the language differs. Good luck finding the right language, quickly, unless that information is also passed along with the same thing that passed the date along. (Edit: And that's not even considering spelling mistakes. ;-))

                            In any case, a moot point as (although I have not looked at the link yet) uncle_alice has posted what is (probably) an acceptable (if not perfect) solution.
                            • 11. Re: Parsing strings to dates with unknown formats
                              807589
                              My point about it being safe to assume 80/20 was that if you enter a 2-digit year of 5 years from now, it can be assumed to mean 2014 and not 1914. The 80/20 I qualified with "maybe." If you want to specify 1914 you should be expected to write the 4 digits.

                              In my particular case I am in fact getting the language and country along with dates, so I'd be happy to pass along a locale as well to a date parser. Regardless, if you enter a date that is confusing to a human you should expect the date to either not be accepted or come out wrong when entering it on a computer.

                              My solution looks like it's going to be something along these lines. This is using Joda-time's DateTimeFormat class as I need the ability to go back much further than the 1500s:
                              String[] patterns = {"MMMM, yyyy", "MMM dd, yy", etc...};
                              for (String pattern : patterns) {
                                  try {
                                      iso8601Date = DateTimeFormat.forPattern(pattern).parseDateTime(date).toString();
                                      break;
                                  } catch (IllegalArgumentException e) {
                                      log.debug("Date of " + date + " was not able to be parsed with the " + pattern + " pattern");
                                  }
                              }
                              It's a little ugly for my tastes though... if anyone has any better suggestions, I'd love to hear them.

                              Edited by: scottatstartel on Sep 23, 2008 3:23 AM

                              Edited by: scottatstartel on Sep 23, 2008 3:24 AM