Commit Graph

  • 21a6b070a7 added X-referring-url: X-anchor-text: and X-surrounding-text: to diffbot http request header. Matt Wells 2013-10-31 11:44:09 -07:00
  • e4cce243de minor documentation updates Matt Wells 2013-10-30 19:43:35 -07:00
  • 2bdbdb8982 nomenclature download->crawl Matt Wells 2013-10-30 16:14:30 -07:00
  • 4892e9eee1 fix issue of losing data destined for a valid collection when one rec caused an error because it was for a deleted collection Matt Wells 2013-10-30 15:48:31 -07:00
  • 59be3b85a3 fix crawl status printing Matt Wells 2013-10-30 13:55:16 -07:00
  • 9d016b5c3c reset spiderstatus Matt Wells 2013-10-30 13:49:31 -07:00
  • 9af1c5cb93 webHook to webhook Matt Wells 2013-10-30 13:39:10 -07:00
  • 3acc1f9a51 deal with if callback is null when deleting/resetting collnum Matt Wells 2013-10-30 13:18:19 -07:00
  • d0ddfb7d7d would block when deleting or resetting a collection when the rdb tree is saving to disk. keeps retrying every 100ms since it modifies the tree. Matt Wells 2013-10-30 13:12:46 -07:00
  • b83dd59913 fix bug when we nuke a collnum from a tree right in the middle of when saving rdb trees in process.cpp. Matt Wells 2013-10-30 12:27:08 -07:00
  • fe2144d13d fix compiler error Matt Wells 2013-10-30 10:06:54 -07:00
  • adf4d258ae better crawl status reporting. allow for _ in coll names. Matt Wells 2013-10-30 10:00:46 -07:00
  • a1ac5a5348 try to fix core from spiderdb scan coming back to no collection record b/c user deleted it. Matt Wells 2013-10-29 16:51:21 -07:00
  • 0efec575e4 Merge branch 'diffbot' of git@github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-10-29 16:37:21 -07:00
  • 2d413578f2 track down some nasty cores. fix for waiting tree out of sync. Matt Wells 2013-10-29 16:37:14 -07:00
  • b22f8d5d19 minor msg update Matt Wells 2013-10-29 15:26:32 -07:00
  • f6a697da7b fix core. mwells 2013-10-29 16:23:40 -06:00
  • f06e1aaa73 added crawl is initializing crawl status msg Matt Wells 2013-10-29 13:24:54 -07:00
  • 1b79c5696e update crawlstatus msgs. Matt Wells 2013-10-29 13:16:01 -07:00
  • 1a6c221d36 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-10-29 09:32:28 -07:00
  • 54c50c1f3a added "retrictDomain" parm which defaults to 1. will restrict spidered urls to same domain as seed urls. Matt Wells 2013-10-29 09:31:57 -07:00
  • 4477472903 just selecting a url to crawl should count as a pagedownloadattempt -- the CrawlInfo counter. removed urlsExamined count because it was too confusing. Matt Wells 2013-10-28 22:38:15 -07:00
  • 20052e34fe made webhook return the crawl name and status as X- fields in the mime. Matt Wells 2013-10-28 22:03:10 -07:00
  • 7bc5c30b16 notification bug fixes. use new "crawlDelay" parm. output that too. Matt Wells 2013-10-28 21:20:44 -07:00
  • 0ed1d2bc1d fix compilation error Matt Wells 2013-10-28 08:05:22 -07:00
  • 5b9a59e4f8 fix core dump in msg8b handler Matt Wells 2013-10-25 14:55:38 -07:00
  • 54d3375a00 fixes when crawling on distributed 2x2 Matt Wells 2013-10-25 14:54:24 -07:00
  • 240da39873 Merge branch 'master' into diffbot Matt Wells 2013-10-25 12:32:02 -07:00
  • 3ee4da7bb5 float/long fixes Matt Wells 2013-10-25 12:11:40 -07:00
  • 5a74a3d3a7 use 7 spiders per ip by default. we have the ip delay to throttle things. Matt Wells 2013-10-25 12:03:15 -07:00
  • b4840c6fe8 added new "wait" crawlbot api parm. Matt Wells 2013-10-25 11:14:56 -07:00
  • bbe0d23536 fix autosaving for proxy. Matt Wells 2013-10-24 23:11:21 -07:00
  • c2d79062ba try to support new "delay" parm Matt Wells 2013-10-24 19:05:57 -07:00
  • 726fdb4873 fix that json RE-encoding bug Matt Wells 2013-10-24 18:09:35 -07:00
  • 129937168d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-10-24 17:59:22 -07:00
  • 896fbc2570 log seeds Matt Wells 2013-10-24 17:59:15 -07:00
  • 242873b272 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-10-24 17:56:10 -07:00
  • fa9f81bd7c trying to fix json decoding bug. make highlight class use safebuf. mwells 2013-10-24 17:55:01 -07:00
  • fb7096dc5d num-mirrors: updates Matt Wells 2013-10-24 14:59:35 -07:00
  • f65a2fd625 support num-mirrors: instead of index-splits: directive. Matt Wells 2013-10-24 14:32:56 -07:00
  • 615e459986 fix double round increment bug. make msg4 send out adds every 500ms not 5000ms so spider is zippier. Matt Wells 2013-10-24 14:05:39 -07:00
  • 990eade6aa show more info in url dump to make spider debugging easier for clients. allow filtered/banned outlinks into spiderdb to help. Matt Wells 2013-10-24 11:32:41 -07:00
  • e5aa795b76 reset seed dedup table when collrec reset Matt Wells 2013-10-23 18:12:50 -07:00
  • 572fa85033 crawl status updates Matt Wells 2013-10-23 17:21:13 -07:00
  • 1b738466c1 more respidering fixes Matt Wells 2013-10-23 17:05:56 -07:00
  • 70d7f715df make errorcount url filter only for tmperrors like ETCPTIMEDOUT Matt Wells 2013-10-23 16:33:49 -07:00
  • ed7f8ff44a append ? to diffbot api url if missing Matt Wells 2013-10-23 16:09:41 -07:00
  • c39b45ff88 fix crawl round end detection etc. inc round counter even if not repeating crawl Matt Wells 2013-10-23 15:53:59 -07:00
  • 469be5f216 moved email logic from xmldoc into spider.cpp. add maxCrawlRounds parm. added crawlStatus msg in json output to indicate why crawl stopped. mwells 2013-10-23 12:49:32 -07:00
  • 9595f65542 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-10-23 12:09:39 -07:00
  • a5a7ab2434 added spider status msg to json output to indicate if spider has hit a limit. no longer disable spiders in xmldoc.cpp when a crawl/process limit is hit. just check for limit when spidering urls in spider.cpp and if it is hit set CollectionRec::m_spiderStatus[Msg] and send email from there. Added maxCrawlRounds parm. Matt Wells 2013-10-23 11:40:30 -07:00
  • a2d54b0d08 nothing. merge test. Matt Wells 2013-10-22 21:53:07 -07:00
  • d2b4c3c2c8 test commit Matt Wells 2013-10-22 21:27:21 -07:00
  • b9bc403dd5 api help updates Matt Wells 2013-10-22 18:55:19 -07:00
  • 22f9e9355d /v2/bulk api fixes Matt Wells 2013-10-22 18:51:09 -07:00
  • 7c47823ec4 store crawl-delays of -1 so we at least know we tried to get it from robots.txt and can spider with the default 3 concurrent spiders. Matt Wells 2013-10-22 17:59:27 -07:00
  • d16e5d37f1 tested robots crawl-delay directive by forcing a 10.1 second delay for diffbot.com in XmlDoc.cpp. seemed to work after a few fixes. however, it is ultimately only an IP-based crawl delay, although the delay applies to all subdomains on the same domain, it's just that each IP has its own timer for that delay. Matt Wells 2013-10-22 17:41:52 -07:00
  • 209e6db25f do not match "isindexed" for getting the diffbot api in XmlDoc::getUrlFilterNum(). do not supply SpiderReply to that function b/c the spider reply is just being generated. Matt Wells 2013-10-22 16:25:26 -07:00
  • 8f5bb4a787 a few core dump fixes. get crawl-delay working a little. about half way done. Matt Wells 2013-10-22 15:44:10 -07:00
  • 8c3a61f070 /v2/crawl api Matt Wells 2013-10-22 12:25:37 -07:00
  • 033a6ec578 update robots.txt Matt Wells 2013-10-21 22:11:11 -07:00
  • c7d7e24f9b show spider rounds and round starttime in json output. fixed url filters display bug. reset seeds safebuf parm when coll is reset. Matt Wells 2013-10-21 19:20:03 -07:00
  • 92f37343c3 fix xml search results output Matt Wells 2013-10-21 19:06:13 -07:00
  • e7cefcbf6a nothing Matt Wells 2013-10-21 18:32:57 -07:00
  • 72113d9ae6 more crawlbot bug fixes Matt Wells 2013-10-21 18:05:45 -07:00
  • 23c58ff08a normalize web hook. otherwise httpserver::getdoc barfs. Matt Wells 2013-10-21 17:51:23 -07:00
  • 0e4d96b3f8 added "seeds" to json reply. store seed urls (and deup them) in collrec. fixed some respidering issues. any time we re-enter url filters then rebuild the waiting tree. Matt Wells 2013-10-21 17:35:14 -07:00
  • f29d8747e4 spider rounds seem to work now Matt Wells 2013-10-21 16:03:28 -07:00
  • 245264c2c9 fix respider frequency bug. Matt Wells 2013-10-21 15:06:23 -07:00
  • 64a1c7c2f2 more bug fixes. if spiders disabled for row in url filters, don't spider the url. Matt Wells 2013-10-21 14:45:12 -07:00
  • 978910ca7a fix more bugs. Matt Wells 2013-10-21 14:17:32 -07:00
  • 1fb85db307 url filters fixes. Matt Wells 2013-10-21 13:44:30 -07:00
  • dc4afad67e do not respider if collectiverespiderfreq is <= 0.0. added a url filter for that. added a couple url filters for retrying errors (tcp timed out, etc) Matt Wells 2013-10-21 12:04:08 -07:00
  • 605289e130 fix a couple collection related bugs causing cores in crawlbot. Matt Wells 2013-10-21 11:38:33 -07:00
  • d2d4379d5c remove debug point. Matt Wells 2013-10-20 10:25:26 -07:00
  • 54915dc384 fix data corruption in RdbMem buffer when running with threads disabled. Matt Wells 2013-10-19 19:37:29 -07:00
  • 85bca4f3d1 can now delete collection while spiders are out Matt Wells 2013-10-18 18:11:14 -07:00
  • 889583ec4b now we can reset collection mid stream Matt Wells 2013-10-18 17:49:36 -07:00
  • ecab57ff0f change collnum of reset collection so any adds in progress will fail. Matt Wells 2013-10-18 15:46:00 -07:00
  • b589b17e63 fix collection resetting. Matt Wells 2013-10-18 15:21:00 -07:00
  • 50313a815f use seeds and spots now Matt Wells 2013-10-18 11:53:14 -07:00
  • a288217e9f a few bug fixes Matt Wells 2013-10-17 18:59:00 -07:00
  • 84a3aded94 spider round updates correction Matt Wells 2013-10-17 17:18:05 -07:00
  • df7fd21253 spider rounds update. Matt Wells 2013-10-17 17:17:19 -07:00
  • fe8ebd23a3 added simplified redirect urls to spiderdb as a new spiderrequest. made XmlDoc::getLinks() call m_links.set(redirUrl.getUrl()) so that it is treated like an outlink on the page and gets added from addOutlinkSpiderRecsToMetaList(). Matt Wells 2013-10-17 12:06:12 -07:00
  • 92413001fb dirty word detector revisions. we need a word-based function, isDirtyWord() which IDs single words, bigrams and trigrams. it'll be much faster than the current approach and won't slow down when the list of dirty words gets big. we then need isDirtyUrl() to use the logic in Speller.cpp to split a url up into composing words to run through isDirtyWord(). mwells 2013-10-16 20:19:49 -07:00
  • b9f94d7d45 show cached json objects as application/json without term highlighting and the disclaimer Matt Wells 2013-10-16 17:54:17 -07:00
  • d9b132fd5a make : into . for indexing json names. Matt Wells 2013-10-16 17:43:46 -07:00
  • 74c2742ced fix mem leak of LinkInfo. fixed json output from injecting url. Matt Wells 2013-10-16 17:17:28 -07:00
  • 70c4ef682d printing updates Matt Wells 2013-10-16 16:27:24 -07:00
  • ee06428059 fix json indexing and searching Matt Wells 2013-10-16 16:15:28 -07:00
  • 9d6c3626d8 json indexing/hashing updates. Matt Wells 2013-10-16 15:41:12 -07:00
  • bb09b4f742 do not store diffbot api url in diffbot reply yet. later may want to store in each diffbot object doc maybe as part of the json content? Matt Wells 2013-10-16 15:24:22 -07:00
  • 11897f09da turn off log debug msg. mwells 2013-10-16 16:24:08 -06:00
  • f8256c3ef9 fix core from diffbot object doc not having valid dmoz info Matt Wells 2013-10-16 15:14:39 -07:00
  • acad6f48d3 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-10-16 14:54:02 -07:00
  • dae005e4ae ensure dmoz info valid when making titlerec Matt Wells 2013-10-16 14:53:48 -07:00
  • 57ee9739e5 fix addColl() logic for collectionless rdbs Matt Wells 2013-10-16 14:38:09 -07:00
  • fc17521697 Merge branch 'master' into diffbot Matt Wells 2013-10-16 14:28:42 -07:00
  • 22ef91a6f1 show all colls in json after deleteCrawl operation Matt Wells 2013-10-16 14:13:28 -07:00