Commit Graph

  • afb5a2be64 Merge branch 'master' into diffbot Matt Wells 2013-11-06 10:18:04 -0800
  • 6c2604b0df re-do spider fix Matt Wells 2013-11-06 10:17:13 -0800
  • bf74eba667 Revert "fix spider "could launch" setting because" Matt Wells 2013-11-06 10:16:46 -0800
  • 22a153cbe3 fix spider "could launch" setting because if we have all spdiers out and doing a tcp timed out then the round would end! Matt Wells 2013-11-06 10:15:35 -0800
  • 34ce22fe19 use timeout bash cmd to prevent ppthtml, etc. hangs Matt Wells 2013-11-06 11:12:00 -0700
  • 5cbcba55fe take out ppthtml call because it is too buggy until we get a ulimit replacement to work. Matt Wells 2013-11-06 10:51:47 -0700
  • 1d03cd740d debug comment Matt Wells 2013-11-05 14:36:23 -0800
  • 263bb8dfbc fix oops Matt Wells 2013-11-05 14:32:56 -0800
  • 2b904e9563 include firstip in the spider url lock, not just uh48, because using fake ips results in having the same url crawled twice since it is from a different "firstip" so we should include "firstip" in the lock as well to prevent a double round increment. see comment in Spider.cpp to this effect. Matt Wells 2013-11-05 14:31:05 -0800
  • f0adb26fdc remove expired locks more often. was causing stuff not to get spidered. Matt Wells 2013-11-05 13:09:56 -0800
  • 2c7035ac2b do not truncate diffbot reply Matt Wells 2013-11-05 11:17:54 -0800
  • fbc743ad5f fixed core dump when host does not have /etc/hostname file present. Matt Wells 2013-11-05 10:13:25 -0700
  • d5c86f720d Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-11-05 09:33:47 -0700
  • 5a5973a47f privacy.html update Matt Wells 2013-11-05 09:33:42 -0700
  • 0baf1e68c1 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-11-04 22:03:15 -0700
  • 8fe64c165c fix potential core dump Matt Wells 2013-11-04 22:03:03 -0700
  • 9335efbf00 fix bug of jenkins spidering same url at same time in different colls Matt Wells 2013-11-04 17:08:11 -0800
  • 74cd3fe0a1 fix spider status stuff. Matt Wells 2013-11-04 16:35:58 -0800
  • 34bffc2cc6 1-second crawl info sleep wrapper update Matt Wells 2013-11-04 16:02:03 -0800
  • ca94750d72 global crawl info realtime updates on local host Matt Wells 2013-11-04 15:07:53 -0800
  • 6f4ce06001 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-04 14:41:52 -0800
  • 2a503095d4 nothing Matt Wells 2013-11-04 14:41:36 -0800
  • 2b9308daef fix deduping. Matt Wells 2013-11-04 14:41:09 -0800
  • 9e955f73a3 fix onlyProcessIfNew parm. fixed gbcontenthash: stuff Matt Wells 2013-11-04 13:57:44 -0800
  • 8c9d5d824b support for gbcontenthash:xxxxx for doing exact match deduping. highest site rank page wins, on ties, lowest docid wins. Matt Wells 2013-11-04 13:47:13 -0800
  • d78413a6c0 quick json validation fix Matt Wells 2013-11-04 11:34:22 -0800
  • d22d2f560e fix json object dump. valid json. Matt Wells 2013-11-04 11:29:22 -0800
  • 9150e8ed50 just show json for specified "name" Matt Wells 2013-11-04 11:05:10 -0800
  • 7b319e5948 show more info in the urls csv file. record whether we processed the url or not in the SpiderReply. normalize /index.html etc. to / for the outlinks. in Links.cpp class. Matt Wells 2013-11-04 10:49:31 -0800
  • c13cce9d72 fix for proxy core mwells 2013-11-03 22:43:44 -0700
  • 21a6b070a7 added X-referring-url: X-anchor-text: and X-surrounding-text: to diffbot http request header. Matt Wells 2013-10-31 11:44:09 -0700
  • e4cce243de minor documentation updates Matt Wells 2013-10-30 19:43:35 -0700
  • 2bdbdb8982 nomenclature download->crawl Matt Wells 2013-10-30 16:14:30 -0700
  • 4892e9eee1 fix issue of losing data destined for a valid collection when one rec caused an error because it was for a deleted collection Matt Wells 2013-10-30 15:48:31 -0700
  • 59be3b85a3 fix crawl status printing Matt Wells 2013-10-30 13:55:16 -0700
  • 9d016b5c3c reset spiderstatus Matt Wells 2013-10-30 13:49:31 -0700
  • 9af1c5cb93 webHook to webhook Matt Wells 2013-10-30 13:39:10 -0700
  • 3acc1f9a51 deal with if callback is null when deleting/resetting collnum Matt Wells 2013-10-30 13:18:19 -0700
  • d0ddfb7d7d would block when deleting or resetting a collection when the rdb tree is saving to disk. keeps retrying every 100ms since it modifies the tree. Matt Wells 2013-10-30 13:12:46 -0700
  • b83dd59913 fix bug when we nuke a collnum from a tree right in the middle of when saving rdb trees in process.cpp. Matt Wells 2013-10-30 12:27:08 -0700
  • fe2144d13d fix compiler error Matt Wells 2013-10-30 10:06:54 -0700
  • adf4d258ae better crawl status reporting. allow for _ in coll names. Matt Wells 2013-10-30 10:00:46 -0700
  • a1ac5a5348 try to fix core from spiderdb scan coming back to no collection record b/c user deleted it. Matt Wells 2013-10-29 16:51:21 -0700
  • 0efec575e4 Merge branch 'diffbot' of git@github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-10-29 16:37:21 -0700
  • 2d413578f2 track down some nasty cores. fix for waiting tree out of sync. Matt Wells 2013-10-29 16:37:14 -0700
  • b22f8d5d19 minor msg update Matt Wells 2013-10-29 15:26:32 -0700
  • f6a697da7b fix core. mwells 2013-10-29 16:23:40 -0600
  • f06e1aaa73 added crawl is initializing crawl status msg Matt Wells 2013-10-29 13:24:54 -0700
  • 1b79c5696e update crawlstatus msgs. Matt Wells 2013-10-29 13:16:01 -0700
  • 1a6c221d36 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-10-29 09:32:28 -0700
  • 54c50c1f3a added "retrictDomain" parm which defaults to 1. will restrict spidered urls to same domain as seed urls. Matt Wells 2013-10-29 09:31:57 -0700
  • 4477472903 just selecting a url to crawl should count as a pagedownloadattempt -- the CrawlInfo counter. removed urlsExamined count because it was too confusing. Matt Wells 2013-10-28 22:38:15 -0700
  • 20052e34fe made webhook return the crawl name and status as X- fields in the mime. Matt Wells 2013-10-28 22:03:10 -0700
  • 7bc5c30b16 notification bug fixes. use new "crawlDelay" parm. output that too. Matt Wells 2013-10-28 21:20:44 -0700
  • 0ed1d2bc1d fix compilation error Matt Wells 2013-10-28 08:05:22 -0700
  • 64efda06d3 Merge 739b7b4c2a into 5b9a59e4f8 kmptanbu 2013-10-26 11:17:13 -0700
  • 5b9a59e4f8 fix core dump in msg8b handler Matt Wells 2013-10-25 14:55:38 -0700
  • 54d3375a00 fixes when crawling on distributed 2x2 Matt Wells 2013-10-25 14:54:24 -0700
  • 240da39873 Merge branch 'master' into diffbot Matt Wells 2013-10-25 12:32:02 -0700
  • 3ee4da7bb5 float/long fixes Matt Wells 2013-10-25 12:11:40 -0700
  • 5a74a3d3a7 use 7 spiders per ip by default. we have the ip delay to throttle things. Matt Wells 2013-10-25 12:03:15 -0700
  • b4840c6fe8 added new "wait" crawlbot api parm. Matt Wells 2013-10-25 11:14:56 -0700
  • bbe0d23536 fix autosaving for proxy. Matt Wells 2013-10-24 23:11:21 -0700
  • c2d79062ba try to support new "delay" parm Matt Wells 2013-10-24 19:05:57 -0700
  • 726fdb4873 fix that json RE-encoding bug Matt Wells 2013-10-24 18:09:35 -0700
  • 129937168d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-10-24 17:59:22 -0700
  • 896fbc2570 log seeds Matt Wells 2013-10-24 17:59:15 -0700
  • 242873b272 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-10-24 17:56:10 -0700
  • fa9f81bd7c trying to fix json decoding bug. make highlight class use safebuf. mwells 2013-10-24 17:55:01 -0700
  • fb7096dc5d num-mirrors: updates Matt Wells 2013-10-24 14:59:35 -0700
  • f65a2fd625 support num-mirrors: instead of index-splits: directive. Matt Wells 2013-10-24 14:32:56 -0700
  • 615e459986 fix double round increment bug. make msg4 send out adds every 500ms not 5000ms so spider is zippier. Matt Wells 2013-10-24 14:05:39 -0700
  • 990eade6aa show more info in url dump to make spider debugging easier for clients. allow filtered/banned outlinks into spiderdb to help. Matt Wells 2013-10-24 11:32:41 -0700
  • e5aa795b76 reset seed dedup table when collrec reset Matt Wells 2013-10-23 18:12:50 -0700
  • 572fa85033 crawl status updates Matt Wells 2013-10-23 17:21:13 -0700
  • 1b738466c1 more respidering fixes Matt Wells 2013-10-23 17:05:56 -0700
  • 70d7f715df make errorcount url filter only for tmperrors like ETCPTIMEDOUT Matt Wells 2013-10-23 16:33:49 -0700
  • ed7f8ff44a append ? to diffbot api url if missing Matt Wells 2013-10-23 16:09:41 -0700
  • c39b45ff88 fix crawl round end detection etc. inc round counter even if not repeating crawl Matt Wells 2013-10-23 15:53:59 -0700
  • 469be5f216 moved email logic from xmldoc into spider.cpp. add maxCrawlRounds parm. added crawlStatus msg in json output to indicate why crawl stopped. mwells 2013-10-23 12:49:32 -0700
  • 9595f65542 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-10-23 12:09:39 -0700
  • a5a7ab2434 added spider status msg to json output to indicate if spider has hit a limit. no longer disable spiders in xmldoc.cpp when a crawl/process limit is hit. just check for limit when spidering urls in spider.cpp and if it is hit set CollectionRec::m_spiderStatus[Msg] and send email from there. Added maxCrawlRounds parm. Matt Wells 2013-10-23 11:40:30 -0700
  • 739b7b4c2a Update about.html kmptanbu 2013-10-23 13:16:02 +0530
  • a2d54b0d08 nothing. merge test. Matt Wells 2013-10-22 21:53:07 -0700
  • d2b4c3c2c8 test commit Matt Wells 2013-10-22 21:27:21 -0700
  • b9bc403dd5 api help updates Matt Wells 2013-10-22 18:55:19 -0700
  • 22f9e9355d /v2/bulk api fixes Matt Wells 2013-10-22 18:51:09 -0700
  • 7c47823ec4 store crawl-delays of -1 so we at least know we tried to get it from robots.txt and can spider with the default 3 concurrent spiders. Matt Wells 2013-10-22 17:59:27 -0700
  • d16e5d37f1 tested robots crawl-delay directive by forcing a 10.1 second delay for diffbot.com in XmlDoc.cpp. seemed to work after a few fixes. however, it is ultimately only an IP-based crawl delay, although the delay applies to all subdomains on the same domain, it's just that each IP has its own timer for that delay. Matt Wells 2013-10-22 17:41:52 -0700
  • 209e6db25f do not match "isindexed" for getting the diffbot api in XmlDoc::getUrlFilterNum(). do not supply SpiderReply to that function b/c the spider reply is just being generated. Matt Wells 2013-10-22 16:25:26 -0700
  • 8f5bb4a787 a few core dump fixes. get crawl-delay working a little. about half way done. Matt Wells 2013-10-22 15:44:10 -0700
  • 8c3a61f070 /v2/crawl api Matt Wells 2013-10-22 12:25:37 -0700
  • 033a6ec578 update robots.txt Matt Wells 2013-10-21 22:11:11 -0700
  • c7d7e24f9b show spider rounds and round starttime in json output. fixed url filters display bug. reset seeds safebuf parm when coll is reset. Matt Wells 2013-10-21 19:20:03 -0700
  • 92f37343c3 fix xml search results output Matt Wells 2013-10-21 19:06:13 -0700
  • e7cefcbf6a nothing Matt Wells 2013-10-21 18:32:57 -0700
  • 72113d9ae6 more crawlbot bug fixes Matt Wells 2013-10-21 18:05:45 -0700
  • 23c58ff08a normalize web hook. otherwise httpserver::getdoc barfs. Matt Wells 2013-10-21 17:51:23 -0700
  • 0e4d96b3f8 added "seeds" to json reply. store seed urls (and deup them) in collrec. fixed some respidering issues. any time we re-enter url filters then rebuild the waiting tree. Matt Wells 2013-10-21 17:35:14 -0700
  • f29d8747e4 spider rounds seem to work now Matt Wells 2013-10-21 16:03:28 -0700