Commit Graph

  • 85a5954256 only apply Defect #2099 updates if it's a bulk job. I didn't see that variable yesterday Daniel Steinberg 2014-03-11 18:52:14 -0700
  • c81bbf6934 more informative error message Daniel Steinberg 2014-03-11 18:10:21 -0700
  • b5be2dcf74 Merge branch 'diffbot-dan' of https://github.com/gigablast/open-source-search-engine into diffbot-dan Daniel Steinberg 2014-03-11 18:09:28 -0700
  • 14c1b2efa3 more informative error message Daniel Steinberg 2014-03-11 18:06:42 -0700
  • 312438a32b Merge branch 'diffbot-dan' into diffbot-testing Matt Wells 2014-03-11 17:02:59 -0700
  • 84784d8d76 minor fixups Matt Wells 2014-03-11 17:02:24 -0700
  • 2331b4673d Defect #2099: throw an error a crawl request was made with a name that already existed for bulk request (or the other way around) Daniel Steinberg 2014-03-11 16:21:58 -0700
  • 8445e53c61 fix query reindex some more Matt Wells 2014-03-11 14:46:49 -0700
  • c4b38a5c72 fix a few cores from previous code updates Matt Wells 2014-03-11 09:36:33 -0700
  • 5c2e78e5fa Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-10 20:26:30 -0700
  • 483f3c5bae fix core Matt Wells 2014-03-10 18:17:28 -0700
  • f9fdc96563 no use in newline separating the list of urls if they're going to be read back in and need to be space separated Daniel Steinberg 2014-03-10 15:22:43 -0700
  • e293d465a3 snprintf instead of sprintf Daniel Steinberg 2014-03-10 14:03:28 -0700
  • 41e3988fbc not a conf file Daniel Steinberg 2014-03-10 13:57:13 -0700
  • 4a7bf5d4d0 Story #2040: store raw URL submissions for customer bulk jobs Daniel Steinberg 2014-03-10 13:50:30 -0700
  • bfcb7082f4 fix bug from nuking doledb on a new collection. Matt Wells 2014-03-10 13:48:00 -0700
  • bd4484db3c Merge branch 'testing' into diffbot-testing Matt Wells 2014-03-10 12:08:23 -0700
  • 9debee20dc Merge branch 'diffbot' into testing Matt Wells 2014-03-09 20:44:09 -0700
  • 662b6d4b32 doc updates Matt Wells 2014-03-09 20:43:49 -0700
  • 90ff2c2a25 update example site lists Matt Wells 2014-03-09 20:35:45 -0700
  • 82db7240a3 simple print update Matt Wells 2014-03-09 19:43:32 -0700
  • f7b7274ff1 replace "exact:" directive with "seed:" really the same thing. Matt Wells 2014-03-09 19:35:20 -0700
  • f8e561e6f4 more new site list api fixes Matt Wells 2014-03-09 18:15:57 -0700
  • 11e8c16878 new site list updates Matt Wells 2014-03-09 17:53:24 -0700
  • ed626b162a more site list based spider fixes to be more like gsa Matt Wells 2014-03-08 20:52:31 -0700
  • aab165ed20 fix bad return value from function Matt Wells 2014-03-08 19:32:56 -0800
  • 4cb66c31bf get this new api spidering Matt Wells 2014-03-08 12:02:20 -0700
  • 624c1d4e68 nuke doledb fixes Matt Wells 2014-03-08 10:51:15 -0700
  • 29694f4efe startup fixes Matt Wells 2014-03-08 10:25:56 -0700
  • 8aa0662a27 Merge branch 'diffbot' into testing Matt Wells 2014-03-08 09:38:44 -0700
  • 14817df7a9 new site patterns api stuff Matt Wells 2014-03-08 09:23:32 -0700
  • 7cdd411ef1 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-07 09:26:47 -0800
  • 72fab5b61e Do not end a crawl while urls are still being spidered because they might add more links to spiderdb when they finally complete. Matt Wells 2014-03-07 09:30:12 -0800
  • dcd42e455e Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-07 09:02:29 -0800
  • c143ee1fba fix core when creating a new collection because we incremented m_numRecs but did not grow the ptr buffer. also added support for localgb.conf so we can use that instead of gb.conf to avoid git push/pull conflicts. Matt Wells 2014-03-07 09:05:14 -0800
  • f777e6cccd Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-07 08:23:21 -0800
  • d6177019ec minor fix Matt Wells 2014-03-07 08:07:09 -0800
  • 434dd182d4 fix mem leak. always harvest links for custom crawls. Matt Wells 2014-03-06 21:24:39 -0800
  • e351d2a6f1 get searching on token working Matt Wells 2014-03-06 17:01:41 -0800
  • 27e8e810d2 use collnum instead of coll string. more stable since resetting collections keeps string the same but changes the collnum. Matt Wells 2014-03-06 15:48:11 -0800
  • d74f748e93 search all collections under a token if "&token" is given but not "&c=..." Matt Wells 2014-03-06 11:00:43 -0800
  • 97e46dbf4e Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-03-06 10:45:45 -0800
  • ca2d307229 revert gb.conf Matt Wells 2014-03-06 10:47:03 -0800
  • efa92b16fd Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-03-06 10:45:35 -0800
  • 25cf0efdbf first compiled stab at multi collection searching. Matt Wells 2014-03-06 10:45:13 -0800
  • 451a092378 fix core from changing parms while evaluating a url. Matt Wells 2014-03-06 07:47:43 -0800
  • 0962e243a4 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-05 07:43:25 -0800
  • 58a1feeea5 specify &header=1 explicitly to get json serp header lest we break our clients parsers Matt Wells 2014-03-05 07:41:59 -0800
  • 13e33bc261 fix jezebel crawl from hanging. Matt Wells 2014-03-04 19:45:26 -0800
  • 1b62f1582b print memtable when almost full so we can see where the leak is. more spiders for ethan. do not try to get diffbot reply if page is already json. likely it is an injected diffbot json reply. Matt Wells 2014-03-04 18:19:50 -0800
  • 603cd67758 fix csv downloads some more Matt Wells 2014-03-04 12:07:46 -0800
  • 2ab9aaeeaa streaming csv fixes Matt Wells 2014-03-04 11:04:26 -0800
  • 866b09d25e Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-03-04 10:46:28 -0800
  • b1381cc610 make csv streamable, faster and take almost no memory. Matt Wells 2014-03-04 10:45:57 -0800
  • 280dcb85cf fix for passing testUpdatedContent smoketest Matt Wells 2014-03-04 09:09:51 -0800
  • ab9f2b33c1 definition updates Matt Wells 2014-03-04 08:37:39 -0700
  • 1acb16b1ee tweak empty doledb priority logic. anchor it more to m_doleIpTable for more reliability. seems like it was causing some slow dows during spidering. seems more continuous now. Matt Wells 2014-03-03 13:48:59 -0800
  • 48b5330d9c only skip checking to spider a url of its doleip table is empty Matt Wells 2014-03-03 13:22:27 -0800
  • 282dad6cef deal with no coll recs when getting link text using msg25. do not share g_lineTable between collections. Matt Wells 2014-03-03 08:04:24 -0800
  • ff8a0b4ef1 do not let all collections share the same line table in linkdb.cpp Matt Wells 2014-03-03 07:50:11 -0800
  • a82abe8260 added ^ operator to url crawl patterns. good for tmz crawl. Matt Wells 2014-03-02 14:57:59 -0800
  • 7fd6bbd7f5 added ^ support to url crawl expressions Matt Wells 2014-03-02 14:41:25 -0800
  • e4d425c18f fix coll being deleted when getting link text. Matt Wells 2014-03-02 14:24:49 -0800
  • bb5016e88b add the following fields to json search results: currentTimeUTC, responseTimeMS, docsInCollection, hits, moreResultsFollow, and docId. Changes structure of json so that now the results array is returned as an array within a dictionary (field name "results") as opposed to being the only object returned Daniel Steinberg 2014-03-01 11:16:17 -0800
  • aeb2833d20 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-28 11:46:44 -0800
  • 11efab9862 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-28 08:23:59 -0800
  • c596d38e60 fix core from getting title of json object Matt Wells 2014-02-28 08:18:09 -0800
  • 5f3aa24805 took out restrictDomain logic. now we always only follow links on the same domain as the seed UNLESS a url crawl pattern or a url crawl regex was specified. Matt Wells 2014-02-27 19:53:17 -0800
  • 42f254125e fix core in new link text logic. empty msg25 replies are ok if g_errno is set. Matt Wells 2014-02-27 13:56:32 -0800
  • 365fc16606 fix core in "wait in line" logic when getting link info in Linkdb.cpp. Matt Wells 2014-02-27 09:22:35 -0800
  • af9eb8fb73 need to allow clients to not restrict to seed domains. Matt Wells 2014-02-26 22:27:22 -0800
  • 927f4626ee Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-26 22:26:13 -0800
  • eaca38cbfd fix new result streaming logic some more Matt Wells 2014-02-26 21:42:43 -0800
  • 0933884191 fix super fast and mem efficient search results streaming code. Matt Wells 2014-02-26 21:18:08 -0800
  • f11e25024a Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-02-26 20:34:06 -0800
  • 1030e6ada8 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-26 20:30:20 -0800
  • b429f12346 add logic to save memory when streaming over 200 results back. should fix oom when streaming back hundreds of thousands of results. Matt Wells 2014-02-26 20:33:35 -0800
  • 8208178c79 remove "Initial crawl request" dups from the urls.csv. do not count fake firstip spider requests attempts in xmldoc.cpp as crawlbot page download attempts since we just re-add that request with the correct firstip and bail. it basically doubles this count form what users would expect. Matt Wells 2014-02-26 15:48:52 -0800
  • b450bfc2a6 do not show html column in csv. libreoffice and excel flub it if a cell is over 32k or so. Matt Wells 2014-02-26 15:03:05 -0800
  • a0697e1bb5 do not allow custom crawls to spider the web any more. Matt Wells 2014-02-26 10:26:09 -0800
  • 6445b0572b fix gb.conf Matt Wells 2014-02-26 01:08:20 -0800
  • a6b7e088f5 take out tfndb, unused. fix core from diffbot url too long. Matt Wells 2014-02-26 01:07:13 -0800
  • 6716d8f21b remove entry from linetable for linkinfo lookup Matt Wells 2014-02-26 00:27:29 -0800
  • 8bb5d106db fixes for query reindex/delete. Matt Wells 2014-02-25 18:12:45 -0800
  • 33c8123288 more fixes for new link info code. Matt Wells 2014-02-25 13:53:41 -0800
  • 9c486c77ed Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-02-25 12:32:40 -0800
  • cf6695f625 speed up getNumTotalRecs() by caching it basically for 2 seconds since pingserver.cpp calls it all the time. Matt Wells 2014-02-25 12:14:51 -0800
  • b3ff7df904 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-02-25 11:05:46 -0800
  • b58d88c57f fix sections infinite loop bug. Matt Wells 2014-02-25 11:09:07 -0800
  • 94a55bf9a6 fixes for new link info code so it doesn't bottleneck. got EFENCE_SIZE working so we can use efence on large allocs only so we don't go oom using it. might help finding some of the out of bounds writing going on. Matt Wells 2014-02-25 10:55:05 -0800
  • ceb623bb8f do not dedup bulks. only respider urls if error is tmp. mess with msg1 in spider.cpp so niceness is MAX_NICENESS and not 0 because it was not able to trigger a doledb dump. Matt Wells 2014-02-23 20:04:46 -0800
  • 72f1312652 new linkdb code compiling. Matt Wells 2014-02-20 17:27:28 -0800
  • 9820f14066 checkpoint Matt Wells 2014-02-20 14:54:21 -0800
  • 88dfa20cbe docid based spider rec related fixes. Matt Wells 2014-02-20 08:46:00 -0800
  • e87b71caef fix query reindex core Matt Wells 2014-02-19 21:07:01 -0800
  • b37b19ea4a print comma before json item so we do not end in trailing comma ever Matt Wells 2014-02-19 10:04:49 -0800
  • dda7648333 try to fix problem of crawls stopping when they shouldn't. seems like it might be doing the trick. Matt Wells 2014-02-19 00:51:46 -0800
  • b48adc0542 try to fix crawls stopping too early Matt Wells 2014-02-18 10:28:48 -0800
  • ae2aed7066 try to fix a few cores from deleting collections. try to spider urls again if user changes certain crawling parms. like regex, patterns, etc. Matt Wells 2014-02-18 09:44:15 -0800
  • 117f0ca4e8 more bio updates Matt Wells 2014-02-17 15:32:45 -0700